scylladb

Author	SHA1	Message	Date
Anna Stuchlik	b87df354e1	doc: update the warning about shared dictionary training This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page. The warning is replaced with a note about how training data is encrypted. Fixes https://github.com/scylladb/scylladb/issues/29109 Closes scylladb/scylladb#29111 (cherry picked from commit `88b98fac3a`)	2026-03-25 10:59:12 +02:00
Jenkins Promoter	63e5de60da	Update pgo profiles - aarch64	2026-03-15 05:09:46 +02:00
Anna Stuchlik	e70543ce2b	doc: fix the unified installer instructions This commit updates the documentation for the unified installer. - The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS). - The info about CentOS 7 is removed (no longer supported). - Java 8 is removed. - The example for cassandra-stress is removed (as it was already removed on other installation pages). Fixes https://github.com/scylladb/scylladb/issues/28150 Closes scylladb/scylladb#28152 (cherry picked from commit `855c503c63`) Closes scylladb/scylladb#28910 Closes scylladb/scylladb#28927 Closes scylladb/scylladb#28974	2026-03-10 22:45:44 +02:00
Anna Stuchlik	8664959368	doc: remove reduntant Java-related information This commit removes: - Instructions to install scylla-jmx (and all references) - The Java 11 requirement for Ubuntu. Fixes https://github.com/scylladb/scylladb/issues/28249 Fixes https://github.com/scylladb/scylladb/issues/28252 Closes scylladb/scylladb#28254 (cherry picked from commit `64b1798513`) Closes scylladb/scylladb#28888 Closes scylladb/scylladb#28906	2026-03-05 21:12:27 +02:00
Jenkins Promoter	c062b9e664	Update ScyllaDB version to: 2025.3.9	2026-03-05 17:35:21 +02:00
Botond Dénes	0ba3dabcd9	Merge '[Backport 2025.3] docs: update a documentation of adding/removing DC and rebuilding a node' from Scylladb[bot] Describe a procedure to convert tablet keyspace replication factor to rack list. Update the procedures of adding and removing a node to consider tablet keyspaces. Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398) Fixes: https://github.com/scylladb/scylladb/issues/28306. Fixes: https://github.com/scylladb/scylladb/issues/28307. Fixes: https://github.com/scylladb/scylladb/issues/28270. Needs backport to all live branches as they all include tablets. [SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ - (cherry picked from commit `eefe66b2b2`) - (cherry picked from commit `e08ac60161`) - (cherry picked from commit `1c764cf6ea`) - (cherry picked from commit `e4c42acd8f`) - (cherry picked from commit `9ccc95808f`) Parent PR: #28521 Closes scylladb/scylladb#28778 * github.com:scylladb/scylladb: docs: update nodetool rebuild docs docs: update a procedure of decommissioning a DC docs: update a procedure of adding a DC	2026-03-03 13:27:50 +02:00
Łukasz Paszkowski	77fba1c351	test/pylib/util.py: Add retries and additional logging to start_writes() Consider the following scenario: 1. Let nodes A,B,C form a cluster with RF=3 2. Write query with CL=QUORUM is submitted and is acknowledged by nodes B,C 3. Follow-up read query with CL=QUORUM is sent to verify the write from the previous step 4. Coordinator sends data/digest requests to the nodes A,B. Since the node A is missing data, digest mismatches and data reconciliation is triggered 5. The node A or B fails, becomes unavailable, etc 6. During reconciliation, data requests are sent to node A,B and fail failing the entire read query When the above scenario happens, the tests using `start_writes()` fail with the following stacktrace: ``` ... > await finish_writes() test/cluster/test_tablets_migration.py:259: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ test/pylib/util.py:241: in finish await asyncio.gather(*tasks) test/pylib/util.py:227: in do_writes raise e _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ worker_id = 1 ... > rows = await cql.run_async(rd_stmt, [pk]) E cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1} ``` Note that when a node failure happens before/during a read query, there is no test failure as the speculative retries are enabled by default. Hence an additional data/digest read is sent to the third remaining node. However, the same speculative read is cancelled the moment, the read query reaches CL which may trigger a read-repair. This change: - Retries the verification read in start_writes() on failure to mitigate races between reads and node failures - Adds additional logging to correlate Python exceptions with Scylla logs Fixes https://github.com/scylladb/scylladb/issues/27478 Fixes https://github.com/scylladb/scylladb/issues/27974 Fixes https://github.com/scylladb/scylladb/issues/27494 Fixes https://github.com/scylladb/scylladb/issues/23529 Note that this change test flakiness observed during tablet transitions. However, it serves as a workaround for a higher-level issue https://github.com/scylladb/scylladb/issues/28125 Closes scylladb/scylladb#28140 (cherry picked from commit `e07fe2536e`) Closes scylladb/scylladb#28826	2026-03-03 13:27:14 +02:00
Jenkins Promoter	a352e4af5b	Update pgo profiles - aarch64	2026-03-01 05:12:00 +02:00
Jenkins Promoter	c8a21f0b2e	Update pgo profiles - x86_64	2026-03-01 04:35:46 +02:00
Aleksandra Martyniuk	8433ae86c9	docs: update nodetool rebuild docs Update nodetool rebuild docs to mention that the command does not work for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28270. (cherry picked from commit `9ccc95808f`)	2026-02-26 11:57:31 +01:00
Aleksandra Martyniuk	1c3306aeaf	docs: update a procedure of decommissioning a DC Update a procedure of decommissioning a DC for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28307. (cherry picked from commit `e4c42acd8f`)	2026-02-26 11:54:22 +01:00
Aleksandra Martyniuk	f754d95346	docs: update a procedure of adding a DC Update a procedure of adding a DC for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28306. (cherry picked from commit `1c764cf6ea`)	2026-02-26 11:52:31 +01:00
Botond Dénes	6edae6c138	Merge '[Backport 2025.3] test: cluster: Fix test_sync_point' from Scylladb[bot] The test `test_sync_point` had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. --- Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. As a bonus, we rewrite the auxiliary code responsible for fetching metrics and manipulating sync points. Now it's asynchronous and uses the existing standard mechanisms available to developers. Furthermore, we reduce the time needed for executing `test_sync_point` by 27 seconds. --- The total difference in time needed to execute the whole test file (on my local machine, in dev mode): Before: CPU utilization: 0.9% real 2m7.811s user 0m25.446s sys 0m16.733s After: CPU utilization: 1.1% real 1m40.288s user 0m25.218s sys 0m16.566s --- Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203 Backport: This improves the stability of our CI, so let's backport it to all supported versions. - (cherry picked from commit `628e74f157`) - (cherry picked from commit `ac4af5f461`) - (cherry picked from commit `c5239edf2a`) - (cherry picked from commit `a256ba7de0`) - (cherry picked from commit `f83f911bae`) Parent PR: #28602 Closes scylladb/scylladb#28621 * github.com:scylladb/scylladb: test: cluster: Reduce wait time in test_sync_point test: cluster: Fix test_sync_point test: cluster: Await sync points asynchronously test: cluster: Create sync points asynchronously test: cluster: Fetch hint metrics asynchronously	2026-02-26 10:08:39 +02:00
Yaron Kaikov	fa67480d27	.github/workflows: enable automatic backport PR creation with Jira sub-issue integration This workflow calls the reusable backport-with-jira workflow from scylladb/github-automation to enable automatic backport PR creation with Jira sub-issue integration. The workflow triggers on: - Push to master/next-/branch- branches (for promotion events) - PR labeled with backport/X.X pattern (for manual backport requests) - PR closed/merged on version branches (for chain backport processing) Features enabled by calling the shared workflow: - Creates Jira sub-issues under the main issue for each backport version - Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2) - Cherry-picks from previous version branch to avoid repeated conflicts - On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR Closes scylladb/scylladb#28804 (cherry picked from commit `b211590bc0`) Closes scylladb/scylladb#28814	2026-02-26 10:08:08 +02:00
Marcin Maliszkiewicz	134f2c1a06	Merge '[Backport 2025.3] transport: fix connection code to consume only initially taken semaphore units' from Scylladb[bot] The connection's `cpu_concurrency_t` struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485 Backport: all supported affected versions, bug introduced with initial feature implementation in: `ed3e4f33fd` - (cherry picked from commit `0376d16ad3`) - (cherry picked from commit `3b98451776`) Parent PR: #28530 Closes scylladb/scylladb#28713 * github.com:scylladb/scylladb: test: decrease strain in test_startup_response test: auth_cluster: add test for hanged AUTHENTICATING connections transport: fix connection code to consume only initially taken semaphore units transport: remove redundant futurize_invoke from counted data sink and source	2026-02-23 13:20:11 +01:00
Marcin Maliszkiewicz	2c0295962e	test: decrease strain in test_startup_response For 2025.3 and 2025.4 this test runs order of magnitude slower in debug mode. Potentially due to passwords::check running in alien thread and overwhelming the CPU (this is fixed in newer versions). Decreasing the number of connections in test makes it fast again, without breaking reproducibility. As additional measure we double the timeout.	2026-02-20 10:20:18 +01:00
Marcin Maliszkiewicz	180d4d6206	test: auth_cluster: add test for hanged AUTHENTICATING connections Test runtime: Release - 2s Debug - 5s (cherry picked from commit `3b98451`)	2026-02-19 16:33:23 +01:00
Marcin Maliszkiewicz	dfd77a7a9c	transport: fix connection code to consume only initially taken semaphore units The connection's cpu_concurrency_t struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485 (cherry picked from commit `0376d16`)	2026-02-19 16:33:23 +01:00
Marcin Maliszkiewicz	4cff0fddd4	transport: remove redundant futurize_invoke from counted data sink and source Closes scylladb/scylladb#27526 (cherry picked from commit `d5b63df`)	2026-02-19 16:33:23 +01:00
Dawid Mędrek	4e5eebe422	test: cluster: Reduce wait time in test_sync_point If everything is OK, the sync point will not resolve with node 3 dead. As a result, the waiting will use all of the time we allocate for it, i.e. 30 seconds. That's a lot of time. There's no easy way to verify that the sync point will NOT resolve, but let's at least reduce the waiting to 3 seconds. If there's a bug, it should be enough to trigger it at some point, while reducing the average time needed for CI. (cherry picked from commit `f83f911bae`)	2026-02-19 14:30:56 +01:00
Dawid Mędrek	d59d7defee	test: cluster: Fix test_sync_point The test had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203 (cherry picked from commit `a256ba7de0`)	2026-02-19 14:30:55 +01:00
Dawid Mędrek	916ba5300f	test: cluster: Await sync points asynchronously There's a dedicated HTTP API for communicating with the cluster, so let's use it instead of yet another custom solution. (cherry picked from commit `c5239edf2a`)	2026-02-19 14:30:54 +01:00
Dawid Mędrek	a865886f7b	test: cluster: Create sync points asynchronously There's a dedicated HTTP API for communicating with the nodes, so let's use it instead of yet another custom solution. (cherry picked from commit `ac4af5f461`)	2026-02-19 14:29:58 +01:00
Avi Kivity	d657044d70	Merge '[Backport 2025.3] s3_client: Fix s3 part size and number of parts calculation' from Scylladb[bot] - Correct `calc_part_size` function since it could return more than 10k parts - Add tests - Add more checks in `calc_part_size` to comply with S3 limits Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640 Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters - (cherry picked from commit `289e910cec`) - (cherry picked from commit `6280cb91ca`) - (cherry picked from commit `960adbb439`) Parent PR: #28592 Closes scylladb/scylladb#28695 * github.com:scylladb/scylladb: s3_client: add more constrains to the calc_part_size s3_client: add tests for calc_part_size s3_client: correct multipart part-size logic to respect 10k limit	2026-02-19 14:14:01 +02:00
Wojciech Mitros	0d1e7002c2	mv: don't mark the view as built if the reader produced no partitions When we build a materialized view we read the entire base table from start to end to generate all required view udpates. If a view is created while another view is being built on the same base table, this is optimized - we start generating view udpates for the new view from the base table rows that we're currently reading, and we read the missed initial range again after the previous view finishes building. The view building progress is only updated after generating view updates for some read partitions. However, there are scenarios where we'll generate no view updates for the entire read range. If this was not handled we could end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293 To handle this, we mark the view as built if the reader generated no partitions. However, this is not always the correct conclusion. Another scenario where the reader won't encounter any partitions is when view building is interrupted, and then we perform a reshard. In this scenario, we set the reader for all shards to the last unbuilt token for an existing partition before the reshard. However, this partition may not exist on a shard after reshard, and if there are also no partitions with higher tokens, the reader will generate no partitions even though it hasn't finished view building. Additionally, we already have a check that prevents infinite view building loops without taking the partitions generated by the reader into account. At the end of stream, before looping back to the start, we advance current_key to the end of the built range and check for built views in that range. This handles the case where the entire range is empty - the conditions for a built view are: 1. the "next_token" is no greater than "first_token" (the view building process looped back, so we've built all tokens above "first_token") 2. the "current_token" is no less than "first_token" (after looping back, we've built all tokens below "first_token") If the range is empty, we'll pass these conditions on an empty range after advancing "current_key" to the end because: 1. after looping back, "next_token" will be set to `dht::minimum_token` 2. "current_key" will be set to `dht::ring_position::max()` In this patch we remove the check for partitions generated by the reader. This fixes the issue with resharding and it does not resurrect the issue with infinite view building that the check was introduced for. Fixes https://github.com/scylladb/scylladb/issues/26523 Closes scylladb/scylladb#26635 (cherry picked from commit `0a22ac3c9e`) Closes scylladb/scylladb#26887	2026-02-18 13:03:05 +02:00
Wojciech Mitros	4f3d22694a	alternator: use storage_proxy from the correct shard in executor::delete_table When we delete a table in alternator, the schema change is performed on shard 0. However, we actually use the storage_proxy from the shard that is handling the delete_table command. This can lead to problems because some information is stored only on shard 0 and using storage_proxy from another shard may make us miss it. In this patch we fix this by using the storage_proxy from shard 0 instead. Fixes https://github.com/scylladb/scylladb/issues/27223 Closes scylladb/scylladb#27224 (cherry picked from commit `3c376d1b64`) Closes scylladb/scylladb#27259	2026-02-18 13:01:05 +02:00
Botond Dénes	09858b7e86	Merge '[Backport 2025.3] service: pass topology guard to RBNO' from Scylladb[bot] Currently, raft-based node operations with streaming use topology guards, but repair-based don't. Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes. Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards. Fixes: https://github.com/scylladb/scylladb/issues/27759 No topology_guard in any version; needs backport to all versions - (cherry picked from commit `3fe596d556`) - (cherry picked from commit `2be5ee9f9d`) Parent PR: #27839 Closes scylladb/scylladb#28297 * github.com:scylladb/scylladb: service: use session variable for streaming service: pass topology guard to RBNO	2026-02-18 12:58:10 +02:00
Ernest Zaslavsky	c9c62f1e83	s3_client: limit multipart upload concurrency Prevent launching hundreds or thousands of fibers during multipart uploads by capping concurrent part submissions to 16. Closes scylladb/scylladb#28554 (cherry picked from commit `034c6fbd87`) Closes scylladb/scylladb#28664	2026-02-18 12:56:56 +02:00
Calle Wilund	c55b28f1c2	commitlog: Always abort replenish queue on loop exit Fixes #28678 If replenish loop exits the sleep condition, with an empty queue, when "_shutdown" is already set, a waiter might get stuck, unsignalled waiting for segments, even though we are exiting. Simply move queue abort to always be done on loop exit. Closes scylladb/scylladb#28679 (cherry picked from commit `ab4e4a8ac7`) Closes scylladb/scylladb#28691	2026-02-18 12:56:03 +02:00
Patryk Jędrzejczak	cb6446816f	test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart The test can currently fail like this: ``` > await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">}) ``` The following happens: - node A is restarted and becomes the group0 leader, - the driver sends the ALTER TABLE request to node B, - the request hits group 0 concurrent modification error 10 times and fails because node A performs tablet migrations at the the same time. What is unexpected is that even though the driver session uses the default retry policy, the driver doesn't retry the request on node A. The request is guaranteed to succeed on node A because it's the only node adding group0 entries. The driver doesn't retry the request on node A because of a missing `wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect the driver just in case to prevent hitting scylladb/python-driver#295. Moreover, we can revert the workaround from `4c9efc08d8`, as the fix from this commit also prevents DROP KEYSPACE failures. The commit has been tested in byo with `_concurrent_ddl_retries{0}` to verify that node A really can't hit group 0 concurrent modification error and always receives the ALTER TABLE request from the driver. All 300 runs in each build mode passed. Fixes #25938 Closes scylladb/scylladb#28632 (cherry picked from commit `0693091aff`) Closes scylladb/scylladb#28671	2026-02-18 10:40:22 +01:00
Ernest Zaslavsky	44a89969ab	s3_client: add more constrains to the calc_part_size Enforce more checks on part size and object size as defined in "Amazon S3 multipart upload limits", see https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html (cherry picked from commit `960adbb439`)	2026-02-18 09:40:10 +00:00
Ernest Zaslavsky	fa1ca8096b	s3_client: add tests for calc_part_size Introduce tests that validate the corrected multipart part-size calculation, including boundary conditions and error cases. (cherry picked from commit `6280cb91ca`)	2026-02-18 09:40:10 +00:00
Ernest Zaslavsky	137233b1e6	s3_client: correct multipart part-size logic to respect 10k limit The previous calculation could produce more than 10,000 parts for large uploads because we mixed values in bytes and MiB when determining the part size. This could result in selecting a part size that still exceeded the AWS multipart upload limit. The updated logic now ensures the number of parts never exceeds the allowed maximum. This change also aligns the implementation with the code comment: we prefer a 50 MiB part size because it provides the best performance, and we use it whenever it fits within the 10,000-part limit. If it does not, we increase the part size (in bytes, aligned to MiB) to stay within the limit. (cherry picked from commit `289e910cec`)	2026-02-18 09:40:10 +00:00
Jenkins Promoter	9109d8ef54	Update pgo profiles - aarch64	2026-02-15 05:15:44 +02:00
Dawid Mędrek	783c15d0c5	test: cluster: Fetch hint metrics asynchronously There's a dedicated API for fetching metrics now. Let's use it instead of developing yet another solution that's also worse. (cherry picked from commit `628e74f157`)	2026-02-12 12:12:04 +00:00
Jenkins Promoter	ea9e07c9e4	Update ScyllaDB version to: 2025.3.8	2026-02-10 22:43:29 +02:00
Avi Kivity	78104f93ac	Merge '[Backport 2025.3] Introduce TTL and retries to address resolution' from Scylladb[bot] In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop. Introduce RR TTL and connection failure retry to - re-resolve the RR in a timely manner - forcefully reset and re-resolve addresses - add a special case when the TTL is 0 and the record must be resolved for every request Fixes: CUSTOMER-96 Fixes: CUSTOMER-139 Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3 - (cherry picked from commit `bd9d5ad75b`) - (cherry picked from commit `359d0b7a3e`) - (cherry picked from commit `ce0c7b5896`) - (cherry picked from commit `5b3e513cba`) - (cherry picked from commit `66a33619da`) - (cherry picked from commit `6eb7dba352`) - (cherry picked from commit `a05a4593a6`) - (cherry picked from commit `3a31380b2c`) - (cherry picked from commit `912c48a806`) Parent PR: #27891 Closes scylladb/scylladb#28403 * github.com:scylladb/scylladb: connection_factory: includes cleanup dns_connection_factory: refine the move constructor connection_factory: retry on failure connection_factory: introduce TTL timer connection_factory: get rid of shared_future in dns_connection_factory connection_factory: extract connection logic into a member connection_factory: remove unnecessary `else` connection_factory: use all resolved DNS addresses s3_test: remove client double-close	2026-02-05 13:10:46 +02:00
Patryk Jędrzejczak	658a65f967	Merge '[Backport 2025.3] storage_service: set up topology properly in maintenance mode' from Scylladb[bot] We currently make the local node the only token owner (that owns the whole ring) in maintenance mode, but we don't update the topology properly. The node is present in the topology, but in the `none` state. That's how it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in `scylla_main`. As a result, the node started in maintenance mode crashes in the following way in the presence of a vnodes-based keyspace with the NetworkTopologyStrategy: ``` scylla: locator/network_topology_strategy.cc:207: locator::natural_endpoints_tracker::natural_endpoints_tracker( const token_metadata &, const network_topology_strategy::dc_rep_factor_map &): Assertion `!_token_owners.empty() && !_racks.empty()' failed. ``` Both `_token_owners` and `_racks` are empty. The reason is that `_tm.get_datacenter_token_owners()` and `_tm.get_datacenter_racks_token_owners()` called above filter out nodes in the `none` state. This bug basically made maintenance mode unusable in customer clusters. We fix it by changing the node state to `normal`. We also extend `test_maintenance_mode` to provide a reproducer for Fixes #27988 This PR must be backported to all branches, as maintenance mode is currently unusable everywhere. - (cherry picked from commit `a08c53ae4b`) - (cherry picked from commit `9d4a5ade08`) - (cherry picked from commit `c92962ca45`) - (cherry picked from commit `408c6ea3ee`) - (cherry picked from commit `53f58b85b7`) - (cherry picked from commit `867a1ca346`) - (cherry picked from commit `6c547e1692`) - (cherry picked from commit `7e7b9977c5`) Parent PR: #28322 Closes scylladb/scylladb#28497 * https://github.com/scylladb/scylladb: test: test_maintenance_mode: enable maintenance mode properly test: test_maintenance_mode: shutdown cluster connections test: test_maintenance_mode: run with different keyspace options test: test_maintenance_mode: check that group0 is disabled by creating a keyspace test: test_maintenance_mode: get rid of the conditional skip test: test_maintenance_mode: remove the redundant value from the query result storage_proxy: skip validate_read_replica in maintenance mode storage_service: set up topology properly in maintenance mode	2026-02-04 16:52:50 +01:00
Ernest Zaslavsky	46b2baa0fe	connection_factory: includes cleanup (cherry picked from commit `912c48a806`)	2026-02-04 09:39:42 +02:00
Ernest Zaslavsky	4e3de5d209	dns_connection_factory: refine the move constructor Clean up the awkward move constructor that was declared in the header but defaulted in a separate compilation unit, improving clarity and consistency. (cherry picked from commit `3a31380b2c`)	2026-02-04 09:39:42 +02:00
Ernest Zaslavsky	c370887ff8	connection_factory: retry on failure If connecting to a provided address throws, renew the address list and retry once (and only once) before giving up. (cherry picked from commit `a05a4593a6`)	2026-02-04 09:39:42 +02:00
Ernest Zaslavsky	ae047e6419	connection_factory: introduce TTL timer Add a TTL-based timer to connection_factory to automatically refresh resolved host name addresses when they expire. (cherry picked from commit `6eb7dba352`)	2026-02-04 09:39:42 +02:00
Ernest Zaslavsky	9d7055d3f5	connection_factory: get rid of shared_future in dns_connection_factory Move state management from dns_connection_factory into state class itself to encapsulate its internal state and stop managing it from the `dns_connection_factory` (cherry picked from commit `66a33619da`)	2026-02-04 09:31:46 +02:00
Ernest Zaslavsky	4598e0515a	connection_factory: extract connection logic into a member extract connection logic into a private member function to make it reusable (cherry picked from commit `5b3e513cba`)	2026-02-04 09:31:46 +02:00
Ernest Zaslavsky	36bcc95158	connection_factory: remove unnecessary `else` (cherry picked from commit `ce0c7b5896`)	2026-02-04 09:31:45 +02:00
Ernest Zaslavsky	ccad26abf7	connection_factory: use all resolved DNS addresses Improve dns_connection_factory to iterate over all resolved addresses instead of using only the first one. (cherry picked from commit `359d0b7a3e`)	2026-02-04 09:31:45 +02:00
Ernest Zaslavsky	af3f266496	s3_test: remove client double-close `test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call (cherry picked from commit `bd9d5ad75b`)	2026-02-04 09:31:45 +02:00
Patryk Jędrzejczak	d7abd977a0	test: test_maintenance_mode: enable maintenance mode properly The same issue as the one fixed in `394207fd69`. This one didn't cause real problems, but it's still cleaner to fix it. (cherry picked from commit `7e7b9977c5`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	798938933a	test: test_maintenance_mode: shutdown cluster connections Leaked connections are known to cause inter-test issues. (cherry picked from commit `6c547e1692`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	ce98d50a93	test: test_maintenance_mode: run with different keyspace options We extend the test to provide a reproducer for #27988 and to avoid similar bugs in the future. The test slows down from ~14s to ~19s on my local machine in dev mode. It seems reasonable. (cherry picked from commit `867a1ca346`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	74514035dc	test: test_maintenance_mode: check that group0 is disabled by creating a keyspace In the following commit, we make the rest run with multiple keyspaces, and the old check becomes inconvenient. We also move it below to the part of the code that won't be executed for each keyspace. Additionally, we check if the error message is as expected. (cherry picked from commit `53f58b85b7`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	4ddc7a1720	test: test_maintenance_mode: get rid of the conditional skip This skip has already caused trouble. After `0668c642a2`, the skip was always hit, and the test was silently doing nothing. This made us miss #26816 for a long time. The test was fixed in `222eab45f8`, but we should get rid of the skip anyway. We increase the number of writes from 256 to 1000 to make the chance of not finding the key on server A even lower. If that still happens, it must be due to a bug, so we fail the test. We also make the test insert rows until server A is a replica of one row. The expected number of inserted rows is a small constant, so it should, in theory, make the test faster and cleaner (we need one row on server A, so we insert exactly one such row). It's possible to make the test fully deterministic, by e.g., hardcoding the key and tokens of all nodes via `initial_token`, but I'm afraid it would make the test "too deterministic" and could hide a bug. (cherry picked from commit `408c6ea3ee`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	ca78b7beac	test: test_maintenance_mode: remove the redundant value from the query result (cherry picked from commit `c92962ca45`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	f1c1a1267e	storage_proxy: skip validate_read_replica in maintenance mode In maintenance mode, the local node adds only itself to the topology. However, the effective replication map of a keyspace with tablets enabled contains all tablet replicas. It gets them from the tablets map, not the topology. Hence, `network_topology_strategy::sanity_check_read_replicas` hits ``` throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace())); ``` for tablet replicas other than the local node. As a result, all requests to a keyspace with tablets enabled and RF > 1 fail in debug mode (`validate_read_replica` does nothing in other modes). We don't want to skip maintenance mode tests in debug mode, so we skip the check in maintenance mode. We move the `is_debug_build()` check because: - `validate_read_replicas` is a static function with no access to the config, - we want the `!_db.local().get_config().maintenance_mode()` check to be dropped by the compiler in non-debug builds. We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`. (cherry picked from commit `9d4a5ade08`)	2026-02-03 12:32:08 +01:00
Patryk Jędrzejczak	d62734a4a4	storage_service: set up topology properly in maintenance mode We currently make the local node the only token owner (that owns the whole ring) in maintenance mode, but we don't update the topology properly. The node is present in the topology, but in the `none` state. That's how it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in `scylla_main`. As a result, the node started in maintenance mode crashes in the following way in the presence of a vnodes-based keyspace with the NetworkTopologyStrategy: ``` scylla: locator/network_topology_strategy.cc:207: locator::natural_endpoints_tracker::natural_endpoints_tracker( const token_metadata &, const network_topology_strategy::dc_rep_factor_map &): Assertion `!_token_owners.empty() && !_racks.empty()' failed. ``` Both `_token_owners` and `_racks` are empty. The reason is that `_tm.get_datacenter_token_owners()` and `_tm.get_datacenter_racks_token_owners()` called above filter out nodes in the `none` state. This bug basically made maintenance mode unusable in customer clusters. We fix it by changing the node state to `normal`. We also update its rack, datacenter, and shards count. Rack and datacenter are present in the topology somehow, but there is nothing wrong with updating them again. The shard count is also missing, so we better update it to avoid other issues. Fixes #27988 (cherry picked from commit `a08c53ae4b`)	2026-02-03 12:32:07 +01:00
Tomasz Grabiec	9696440054	Merge '[Backport 2025.3] load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Scylladb[bot] When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host. During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail. It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure. This fixes this problem by checking if the table still exists. Fixes: #28359 - (cherry picked from commit `71be10b8d6`) - (cherry picked from commit `92dbde54a5`) Parent PR: #28440 Closes scylladb/scylladb#28469 * github.com:scylladb/scylladb: test: add test and reproducer for load_stats refresh exception load_stats: handle dropped tables when refreshing load_stats	2026-02-03 11:34:55 +01:00
Pavel Emelyanov	22e8a2e4dc	Update seastar submodule (assorted fixes for S3 client update) * seastar 167e47bcc...04d7c63ac (2): > net: expose DNS TTL via net::hostent > http: add virtual close() to connection_factory refs SCYLLADB-435 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28483	2026-02-03 11:40:47 +03:00
Ferenc Szili	b0612a30e8	test: add test and reproducer for load_stats refresh exception This patch adds a test and reproducer for the issue where the load_stats refresh procedure throws exceptions if any of the tables have been dropped since load_stats was produced. (cherry picked from commit `92dbde54a5`)	2026-02-02 19:44:03 +01:00
Jenkins Promoter	aae1f5d3ff	Update pgo profiles - aarch64	2026-02-01 05:09:04 +02:00
Jenkins Promoter	383dd3f728	Update pgo profiles - x86_64	2026-02-01 04:32:56 +02:00
Ferenc Szili	d256d6e509	load_stats: handle dropped tables when refreshing load_stats When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host. During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail. It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure. This patch fixes this problem by checking if the table still exists. (cherry picked from commit `71be10b8d6`)	2026-02-01 00:33:05 +00:00
Botond Dénes	dbdd5cf10b	Merge '[Backport 2025.3] db: batchlog_manager: update _last_replay only if all batches were re…' from Scylladb[bot] …played Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. Needs backport to all live versions. - (cherry picked from commit `4d0de1126f`) - (cherry picked from commit `e3dcb7e827`) Parent PR: #26793 Closes scylladb/scylladb#27092 * github.com:scylladb/scylladb: test: extend test_batchlog_replay_failure_during_repair db: batchlog_manager: update _last_replay only if all batches were replayed	2026-01-30 16:20:47 +02:00
Botond Dénes	3d9f38ccbb	db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot The API contract in partition_version.hh states that when dealing with evictable entries, a real cache tracker pointer has to be passed to all methods that ask for it. The nonpopulating reader violates this, passing a nullptr to the snapshot. This was observed to cause a crash when a concurrent cache read accessed the snapshot with the null tracker. A reproducer is included which fails before and passes after the fix. Fixes: #26847 Closes scylladb/scylladb#28163 (cherry picked from commit `a53f989d2f`) Closes scylladb/scylladb#28278	2026-01-30 16:15:26 +02:00
Patryk Jędrzejczak	43ee6ba035	test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes The test is currently flaky. It tries to get the host ID of the bootstrapping node via the REST API after the node crashes. This can obviously fail. The test usually doesn't fail, though, as it relies on the host ID being saved in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()` repeatedly called in `ScyllaServer.start()`. However, with a very fast crash and unlucky timings, no such call may succeed. We deflake the test by getting the host ID before the crash. Note that at this point, the bootstrapping node must be serving the REST API requests because `await log.wait_for("finished do_send_ack2_msg")` above guarantees that the node has started the gossip shadow round, which happens after starting the REST API. Fixes #28385 Closes scylladb/scylladb#28388 (cherry picked from commit `a2c1569e04`) Closes scylladb/scylladb#28415	2026-01-29 11:31:36 +01:00
Yaron Kaikov	02994a0a3e	.github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation Add support for matching full Atlassian JIRA URLs in the format https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to the bare JIRA key format (SCYLLADB-400). This makes the validation more flexible by accepting both formats that developers commonly use when referencing JIRA issues. Fixes: https://github.com/scylladb/scylladb/issues/28373 Closes scylladb/scylladb#28374 (cherry picked from commit `3f10f44232`) Closes scylladb/scylladb#28392	2026-01-27 16:06:44 +02:00
Gleb Natapov	826c323c04	topology coordinator: complete pending operation for a replaced node A replaced node may have pending operation on it. The replace operation will move the node into the 'left' state and the request will never be completed. More over the code does not expect left node to have a request. It will try to process the request and will crash because the node for the request will not be found. The patch checks is the replaced node has peening request and completes it with failure. It also changes topology loading code to skip requests for nodes that are in a left state. This is not strictly needed, but makes the code more robust. Fixes #27990 Closes scylladb/scylladb#28009 (cherry picked from commit `bee5f63cb6`) Closes scylladb/scylladb#28177	2026-01-26 11:43:36 +01:00
Patryk Jędrzejczak	e1bd21db7e	test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes The test is currently flaky. It tries to get the host ID of the bootstrapping node via the REST API after the node crashes. This can obviously fail. The test usually doesn't fail, though, as it relies on the host ID being saved in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()` repeatedly called in `ScyllaServer.start()`. However, with a very fast crash and unlucky timings, no such call may succeed. We deflake the test by getting the host ID before the crash. Note that at this point, the bootstrapping node must be serving the REST API requests because `await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")` above guarantees that the node has submitted the join topology request, which happens after starting the REST API. Fixes #28227 Closes scylladb/scylladb#28233 (cherry picked from commit `e503340efc`) Closes scylladb/scylladb#28309	2026-01-22 13:18:56 +01:00
Patryk Jędrzejczak	5577ad4881	test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE The test is currently flaky. It incorrectly assumes that a read with CL=LOCAL_ONE will see the data inserted by a preceding write with CL=LOCAL_ONE in the same datacenter with RF=2. The same issue has already been fixed for CL=ONE in `21edec1ace`. The difference is that for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1. We fix the issue for CL=LOCAL_ONE by skipping the check for dc1. Fixes #28253 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#28274 (cherry picked from commit `1f0f694c9e`) Closes scylladb/scylladb#28303	2026-01-22 11:04:46 +01:00
Aleksandra Martyniuk	5482417d30	service: use session variable for streaming Use session that was retrieved at the beginning of the handler for node operations with streaming to ensure that the session id won't change in between. (cherry picked from commit `2be5ee9f9d`)	2026-01-22 11:00:35 +01:00
Aleksandra Martyniuk	d6dc818f03	service: pass topology guard to RBNO Currently, raft-based node operations with streaming use topology guards, but repair-based don't. Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes. Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards. Fixes: https://github.com/scylladb/scylladb/issues/27759 (cherry picked from commit `3fe596d556`)	2026-01-22 11:00:10 +01:00
Aleksandra Martyniuk	cd5508a690	test: extend test_batchlog_replay_failure_during_repair Modify test_batchlog_replay_failure_during_repair to also check that there isn't data resurrection if flushing hints falls within the repair cache timeout. (cherry picked from commit `e3dcb7e827`)	2026-01-22 10:39:18 +01:00
Aleksandra Martyniuk	c3a9415e0c	db: batchlog_manager: update _last_replay only if all batches were replayed Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. (cherry picked from commit `4d0de1126f`)	2026-01-22 10:39:00 +01:00
Tomasz Grabiec	afa4d60ac1	Fix lambda-coroutine fiasco in hint_endpoint_manager.cc Found by copilot. No issue was observed yet. Fixes #27520 Closes scylladb/scylladb#27477 (cherry picked from commit `7bc59e93b2`) Closes scylladb/scylladb#27730	2026-01-21 14:15:36 +02:00
Anna Stuchlik	6dd7752e02	doc: fix the default compaction strategy for Materialized Views Fixes https://github.com/scylladb/scylladb/issues/24483 Closes scylladb/scylladb#27725 (cherry picked from commit `84e9b94503`) Closes scylladb/scylladb#28284	2026-01-21 06:45:36 +02:00
Botond Dénes	cb6b20e197	reader_concurrency_semaphore: improve handling of base resources reader_permit::release_base_resources() is a soft evict for the permit: it releases the resources aquired during admission. This is used in cases where a single process owns multiple permits, creating a risk for deadlock, like it is the case for repair. In this case, release_base_resources() acts as a manual eviction mechanism to prevent permits blockings each other from admission. Recently we found a bad interaction between release_base_resources() and permit eviction. Repair uses both mechanism: it marks its permits as inactive and later it also uses release_base_resources(). This partice might be worth reconsidering, but the fact remains that there is a bug in the reader permit which causes the base resources to be released twice when release_base_resources() is called on an already evicted permit. This is incorrect and is fixed in this patch. Improve release_base_resources(): * make _base_resources const * move signal call into the if (_base_resources_consumed()) { } * use reader_permit::impl::signal() instead of reader_concurrency_semaphore::signal() * all places where base resources are released now call release_base_resources() A reproducer unit test is added, which fails before and passes after the fix. Fixes: #28083 Closes scylladb/scylladb#28155 (cherry picked from commit `b7bc48e7b7`) Closes scylladb/scylladb#28242	2026-01-21 06:45:13 +02:00
Aleksandra Martyniuk	94607c36e0	service: node_ops: remove coroutine::lambda wrappers In storage_service::raft_topology_cmd_handler we pass a lambda wrapped in coroutine::lambda to a function that creates streaming_task_impl. The lambda is kept in streaming_task_impl that invokes it in its run method. The lambda captures may be destroyed before the lambda is called, leading to use after free. Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda. Use this auto dissociate the lambda lifetime from the calling statement. Fixes: https://github.com/scylladb/scylladb/issues/28200. Closes scylladb/scylladb#28201 (cherry picked from commit `65cba0c3e7`) Closes scylladb/scylladb#28241	2026-01-21 06:44:45 +02:00
Ernest Zaslavsky	161b66759c	aws_error: fix nested exception handling The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary Closes scylladb/scylladb#28143 (cherry picked from commit `829bd9b598`) Closes scylladb/scylladb#28239	2026-01-20 11:19:06 +01:00
Asias He	bc8bbf2caa	repair: Allow min max range to be updated for repair history It is observed that: repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb: seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum token,maximum token) is not in the format of (start, end]) This is because repair checks the end of the range to be repaired needs to be inclusive. When small_table_optimization is enabled for regular repair, a (minimum token,maximum token) will be used. To fix, we can relax the check of (start, end] for the min max range. Fixes #27220 Backport to all active branches. (cherry picked from commit `e97a504`) Parent PR: #27357 Closes scylladb/scylladb#27460	2026-01-19 09:40:55 +02:00
Nikos Dragazis	f89af142da	test: database_test: Fix serialization of partition key The `make_key` lambda erroneously allocates a fixed 8-byte buffer (`sizeof(s.size())`) for variable-length strings, potentially causing uninitialized bytes to be included. If such bytes exist and they are not valid UTF-8 characters, deserialization fails: ``` ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7) ``` Fixes #28195. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28197 (cherry picked from commit `8aca7b0eb9`) Closes scylladb/scylladb#28208	2026-01-19 09:40:25 +02:00
Tomasz Grabiec	6b88add507	Merge '[Backport 2025.3] topology_coordinator: Add barrier to cleanup_target' from Scylladb[bot] Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 It's a pre existing issue. Backport is required to all recent 2025.x versions. - (cherry picked from commit `669286b1d6`) - (cherry picked from commit `67f1c6d36c`) - (cherry picked from commit `6163fedd2e`) Parent PR: #27413 Closes scylladb/scylladb#27427 * github.com:scylladb/scylladb: topology_coordinator: Fix the indentation for the cleanup_target case topology_coordinator: Add barrier to cleanup_target test_node_failure_during_tablet_migration: Increase RF from 2 to 3	2026-01-16 16:31:52 +01:00
Calle Wilund	6a7c9cd750	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998 (cherry picked from commit `a7cdb602e1`) Closes scylladb/scylladb#28097	2026-01-16 16:21:22 +02:00
Łukasz Paszkowski	1284c18647	load_sketch: Allow populating load_sketch with normalized current load Currently, tablet allocation intentionally ignores current load ( introduced by the commit #1e407ab) which could cause identical shard selection when allocating a small number of tablets in the same topology. When a tablet allocator is asked to allocate N tablets (where N is smaller than the number of shards on a node), it selects the first N lowest shards. If multiple such tables are created, each allocator run picks the same shards, leading to tablet imbalance across shards. This change initializes the load sketch with the current shard load, scaled into the [0,1] range, ensuring allocation still remains even while starting from globally least-loaded shards. Fixes https://github.com/scylladb/scylladb/issues/27620 Closes https://github.com/scylladb/scylladb/pull/27802 Closes scylladb/scylladb#28106	2026-01-16 13:50:16 +01:00
Patryk Jędrzejczak	d08045bb61	test: test_group0_schema_versioning: wait for schema sync in system.local `test_schema_versioning_with_recovery` is currently flaky. It performs a write with CL=ALL and then checks if the schema version is the same on all nodes by calling `verify_table_versions_synced`. All nodes are expected to sync their schema before handling the replica write. The node in RECOVERY mode should do it through a schema pull, and other nodes should do it through a group 0 read barrier. The problem is in `verify_local_schema_versions_synced` that compares the schema versions in `system.local`. The node in RECOVERY mode updates the schema version in `system.local` after it acknowledges the replica write as completed. Hence, the check can fail. We fix the problem by making the function wait until the schema versions match. Note that RECOVERY mode is about to be retired together with the whole gossip-based topology in 2026.2. So, this test is about to be deleted. However, we still want to fix it, so that it doesn't bother us in older branches. Fixes #23803 Closes scylladb/scylladb#28114 (cherry picked from commit `6b5923c64e`) Closes scylladb/scylladb#28175	2026-01-16 11:20:45 +01:00
Sergey Zolotukhin	abcf02cbda	test: disable test_start_bootstrapped_with_invalid_seed The test intermittently fails when an invalid DNS name is resolved, likely due to ISP DNS error hijacking (see scylladb/scylladb#28153). Disable this test to unblock CI. Fixes scylladb/scylladb#28153 Closes scylladb/scylladb#28162 (cherry picked from commit `799d837295`)	2026-01-15 17:55:35 +02:00
Avi Kivity	fb15c818f6	Update seastar submodule (accept unbounded recursion) * seastar f61814a489...167e47bcce (1): > net: posix_server_socket_impl: coroutinize accept(), fix unbounded recursion Fixes #28166	2026-01-15 14:35:25 +02:00
Jenkins Promoter	642317d217	Update pgo profiles - aarch64	2026-01-15 05:33:48 +02:00
Jenkins Promoter	022a2f9e8d	Update pgo profiles - x86_64	2026-01-15 04:55:51 +02:00
Patryk Jędrzejczak	6a36195f90	Merge '[Backport 2025.3] raft topology: preserve IP -> ID mapping of a replacing node on restart' from Scylladb[bot] We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). Fixes #28057 Backport this PR to all branches as it fixes a problematic bug. - (cherry picked from commit `fc4c2df2ce`) - (cherry picked from commit `4526dd93b1`) - (cherry picked from commit `749b0278e5`) - (cherry picked from commit `0fed9f94f8`) Manually cherry-picked: - `90b5b2c5f5` - `92b165b8c0` Parent PR: #27435 Closes scylladb/scylladb#28098 * https://github.com/scylladb/scylladb: gossiper: add_saved_endpoint: make generations of excluded nodes negative test: introduce test_full_shutdown_during_replace utils: error_injection: allow aborting wait_for_message raft topology: preserve IP -> ID mapping of a replacing node on restart pylib/rest_client.py: encode injection name utils/error_injection: allow to abort `injection_handler::wait_for_message()`	2026-01-14 10:04:38 +01:00
Michał Jadwiszczak	2a45a98262	docs/dev/service_levels: update docs to service levels on raft Since Scylla 6.0, service levels are manged by Raft group0. This patch updates table name used by service levels and adds a paragraph describing service levels on raft. Fixes scylladb/scylladb#18177 Closes scylladb/scylladb#26556 (cherry picked from commit `649efd198f`) Closes scylladb/scylladb#28129	2026-01-13 19:32:27 +01:00
Patryk Jędrzejczak	976f16dbaf	gossiper: add_saved_endpoint: make generations of excluded nodes negative The explanation is in the new comment in `gossiper::add_saved_endpoint`. We add a test for this change. It's "extremely white-box", but it's better than nothing. (cherry picked from commit `0fed9f94f8`)	2026-01-13 16:03:22 +01:00
Patryk Jędrzejczak	eb17a8d940	test: introduce test_full_shutdown_during_replace (cherry picked from commit `749b0278e5`)	2026-01-13 16:03:22 +01:00
Patryk Jędrzejczak	810c7b436e	utils: error_injection: allow aborting wait_for_message The test added in the following commit utilizes it. (cherry picked from commit `4526dd93b1`)	2026-01-13 16:03:22 +01:00
Patryk Jędrzejczak	53623205ee	raft topology: preserve IP -> ID mapping of a replacing node on restart We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). (cherry picked from commit `fc4c2df2ce`)	2026-01-13 16:03:22 +01:00
Petr Gusev	998b4da424	pylib/rest_client.py: encode injection name Sometimes it's convenient to use slashes in injection names, for example my_component/my_method/my_condition. Without quote() we get 'handler not found' error from Scylla. (cherry picked from commit `92b165b8c0`)	2026-01-13 16:03:22 +01:00
Michał Jadwiszczak	161ff63534	utils/error_injection: allow to abort `injection_handler::wait_for_message()` (cherry picked from commit `90b5b2c5f5`)	2026-01-13 16:03:22 +01:00
Anna Stuchlik	48712d4917	doc: add the missing patch upgrade guide for version 2025.3 This upgrade guide has been missing and must be added on branch-2025.3. It does not belong to other branches (including master). Fixes https://github.com/scylladb/scylladb/issues/25522 Closes scylladb/scylladb#28116	2026-01-13 10:38:01 +02:00
Michael Litvak	492cd166f9	db/view/view_update_generator: move discover_staging_sstables to start Call discover_staging_sstables in view_update_generator::start() instead of in the constructor, because the constructor is called during initialization before sstables are loaded. The initialization order was changed in `5d1f74b86a` and caused this regression. It means the view update generator won't discover staging sstables on startup and view updates won't be generated for them. It also causes issues in sstable cleanup. view_update_generator::start() is called in a later stage of the initialization, after sstable loading, so do the discovery of staging sstables there. Fixes scylladb/scylladb#27956 (cherry picked from commit `5077b69c06`) Closes scylladb/scylladb#28091	2026-01-12 13:31:54 +01:00
Patryk Jędrzejczak	cefe7b270f	Merge '[Backport 2025.3] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot] Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions - (cherry picked from commit `ec4069246d`) - (cherry picked from commit `5be6b80936`) - (cherry picked from commit `0342a24ee0`) - (cherry picked from commit `02ee341a03`) - (cherry picked from commit `2a803d2261`) - (cherry picked from commit `93b827c185`) - (cherry picked from commit `ebd667a8e0`) Parent PR: #27643 Closes scylladb/scylladb#28071 * https://github.com/scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-12 11:21:42 +01:00
Jenkins Promoter	f6040eca3c	Update ScyllaDB version to: 2025.3.7	2026-01-11 16:11:56 +02:00
Piotr Dulikowski	4535fe7958	Merge '[Backport 2025.3] service/storage_service: update service levels cache after upgrade to v2' from Scylladb[bot] Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this patch adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90) This fix should be backported to all versions containing service levels on Raft. [SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ - (cherry picked from commit `53d0a2b5dc`) - (cherry picked from commit `be16e42cb0`) Parent PR: #27585 Closes scylladb/scylladb#28072 * github.com:scylladb/scylladb: service/storage_service: update service levels cache after upgrade to v2 service/storage_service: check if service levels were already upgraded before doing migration to raft	2026-01-09 21:09:02 +01:00
Tomasz Grabiec	aca5be7c98	test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners The driver must see server_c before we stop server_a, otherwise there will be no live host in the pool when we attempt to drop the keyspace: ``` @pytest.mark.asyncio async def test_not_enough_token_owners(manager: ManagerClient): """ Test that: - the first node in the cluster cannot be a zero-token node - removenode and decommission of the only token owner fail in the presence of zero-token nodes - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token owners would fall below the RF of some keyspace using tablets """ logging.info('Trying to add a zero-token server as the first server in the cluster') await manager.server_add(config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}, expected_error='Cannot start the first node in the cluster as zero-token') logging.info('Adding the first server') server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"}) logging.info('Adding two zero-token servers') # The second server is needed only to preserve the Raft majority. server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0] logging.info(f'Trying to decommission the only token owner {server_a}') await manager.decommission_node(server_a.server_id, expected_error='Cannot decommission the last token-owning node in the cluster') logging.info(f'Stopping {server_a}') await manager.server_stop_gracefully(server_a.server_id) logging.info(f'Trying to remove the only token owner {server_a} by {server_b}') await manager.remove_node(server_b.server_id, server_a.server_id, expected_error='cannot be removed because it is the last token-owning node in the cluster') logging.info(f'Starting {server_a}') await manager.server_start(server_a.server_id) logging.info('Adding a normal server') await manager.server_add(property_file={"dc": "dc1", "rack": "r2"}) cql = manager.get_cql() await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60) > async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name: test/cluster/test_not_enough_token_owners.py:57: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python3.14/contextlib.py:221: in __aexit__ await anext(self.gen) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830> opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }" host = None @asynccontextmanager async def new_test_keyspace(manager: ManagerClient, opts, host=None): """ A utility function for creating a new temporary keyspace with given options. It can be used in a "async with", as: async with new_test_keyspace(ManagerClient, '...') as keyspace: """ keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host) try: yield keyspace except: logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation") raise else: > await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host) E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')}) test/cluster/util.py:544: NoHostAvailable ``` Fixes #28011 Closes scylladb/scylladb#28040 (cherry picked from commit `34df158605`) Closes scylladb/scylladb#28070	2026-01-09 19:11:40 +01:00
Botond Dénes	16576da935	reader_concurrency_semaphore: add protection against negative count resource leaks The semaphore has detection and protection against regular resource leaks, where some resources go unaccounted for and are not released by the time the semaphore is destroyed. There is no detection or protection against negative leaks: where resources are "made up" of thin air. This kind of leaks looks benign at first sight, a few extra resources won't hurt anyone so long as this is a small amount. But turns out that even a single extra count resource can defeat a very important anti-deadlock protection in can_admit_read(): the special case which admits a new permit regardless of memory resources, when all original count resources all available. This check uses ==, so if resource > original, the protection is defeated indefinitely. Instead of just changing == to >=, we add detection of such negative leaks to signal(), via on_internal_error_noexcept(). At this time I still don't now how this negative leak happens (the code doesn't confess), with this detection, hopefully we'll get a clue from tests or the field. Note that on_internal_error_noexcept() will not generate a coredump, unless ScyllaDB is explicitely configured to do so. In production, it will just generate an error log with a backtrace. The detection also clams the _resources to _initial_resources, to prevent any damage from the negativae leak. I just noticed that there is no unit test for the deadlock protection described above, so one is added in this PR, even if only loosely related to the rest of the patch. Fixes: SCYLLADB-163 Closes scylladb/scylladb#27764 (cherry picked from commit `e4da0afb8d`) Closes scylladb/scylladb#28002	2026-01-09 13:28:14 +02:00
Benny Halevy	c495252be6	test: database_test: do_with_some_data: randomize keys With randomized keys, and since we're inserting only 2 keys, it is possible that they would end up owned only by a single shard, reproducing #27639 in snapshot_list_contains_dropped_tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ebd667a8e0`)	2026-01-09 08:35:29 +02:00
Benny Halevy	b5d537283d	database: truncate_table_on_all_shards: drop outdated TODO comment The comment was added in `83323e155e` Since then, table::seal_active_memtable was improved to guarantee waiting on oustanding flushes on success (See `d55a2ac762`), so we can remove this TODO comment (it also not covered by any issue so nobody is planned to ever work on it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `93b827c185`)	2026-01-09 08:35:00 +02:00
Benny Halevy	fd9ad9a11c	database: truncate_table_on_all_shards: consider can_flush on all shards can_flush might return a different value for each shard so check it right before deciding whether to flush or clear a memtable shard. Note that under normal condition can_flush would always return true now that it checks only the presence of the seal memtable function rather than check memtable_list::empty(). Fixes #27639 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2a803d2261`)	2026-01-09 08:34:31 +02:00
Benny Halevy	4968ea4ab6	memtable_list: unify can_flush and may_flush Now that we have a unit test proving that it's safe to flush an empty memtable list there is no need to distinguish between may_flush and can_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `02ee341a03`)	2026-01-09 08:31:35 +02:00
Benny Halevy	d5dcfd9133	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `0342a24ee0`)	2026-01-09 08:31:33 +02:00
Benny Halevy	430059f64c	replica: table, storage_group, compaction_group: add needs_flush Table needs flush if not all its memtable lists are empty. To be used in the next patch for a unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `5be6b80936`)	2026-01-09 08:21:13 +02:00
Benny Halevy	0e8b738dff	test: database_test: do_with_some_data_in_thread: accept void callback function Many test cases already assume `func` is being called a seastar thread and although the function they pass returns a (ready) future, it serves no purpose other than to conform to the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ec4069246d`)	2026-01-09 08:16:41 +02:00
Michał Jadwiszczak	c0a7b928ed	service/storage_service: update service levels cache after upgrade to v2 Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this commit adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes SCYLLADB-90 (cherry picked from commit `be16e42cb0`)	2026-01-08 22:45:59 +00:00
Michał Jadwiszczak	ff01aa3548	service/storage_service: check if service levels were already upgraded before doing migration to raft There is no need to call `service_level_controller::upgrade_to_v2()` on every topology state load, we only need to do it once. (cherry picked from commit `53d0a2b5dc`)	2026-01-08 22:45:59 +00:00
Anna Stuchlik	adf2d0efd4	doc: remove cassandra-stress from installation instructions The cassandra-stress tool is no longer part of the default package and cannot be run in the way described. This commit removes the instruction to run cassandra-stress. Fixes https://github.com/scylladb/scylladb/issues/24994 Closes scylladb/scylladb#27726 (cherry picked from commit `624869de86`) Closes scylladb/scylladb#27950	2026-01-08 18:01:04 +02:00
Benny Halevy	9f733ceaee	db: system_keyspace: get_group0_history: unfreeze_gently Prevent stall when the group0 history is too long using unfreeze_gently rather than the synchronous unfreeze() function Fixes #27872 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27873 (cherry picked from commit `f60033db63`) Closes scylladb/scylladb#27908	2026-01-08 18:00:13 +02:00
Geoff Montee	9cb3326ecc	Update update-topology-strategy-from-simple-to-network.rst: Multiple clarifications to page and sub-procedures Fixes #27077 Multiple points can be clarified relating to: * Names of each sub-procedure could be clearer * Requirements of each sub-procedure could be clearer * Clarify which keyspaces are relevant and how to check them * Fix typos in keyspace name Closes scylladb/scylladb#26855 (cherry picked from commit `a0734b8605`) Closes scylladb/scylladb#27154	2026-01-08 17:57:56 +02:00
Botond Dénes	8501820798	Merge '[Backport 2025.3] api: storage_service: tasks: unify sync and async compaction APIs' from Scylladb[bot] Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of synchronous and asynchronous cleanup, major compaction, and upgrade_sstables. Fixes: https://github.com/scylladb/scylladb/issues/26715. Requires backports to all live versions - (cherry picked from commit `12dabdec66`) - (cherry picked from commit `044b001bb4`) - (cherry picked from commit `fdd623e6bc`) Parent PR: #26746 Closes scylladb/scylladb#26886 * github.com:scylladb/scylladb: api: storage_service: tasks: unify upgrade_sstable api: storage_service: tasks: force_keyspace_cleanup pi: storage_service: tasks: unify force_keyspace_compaction	2026-01-08 17:56:18 +02:00
Botond Dénes	4b98e3577d	Merge '[Backport 2025.3] db: repair: do not update repair_time if batchlog replay failed' from Scylladb[bot] Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions - (cherry picked from commit `502b03dbc6`) - (cherry picked from commit `904183734f`) - (cherry picked from commit `7f20b66eff`) - (cherry picked from commit `e1b2180092`) - (cherry picked from commit `d436233209`) - (cherry picked from commit `1935268a87`) - (cherry picked from commit `6fc43f27d0`) Parent PR: #26319 Closes scylladb/scylladb#26762 * github.com:scylladb/scylladb: repair: throw if flush failed in get_flush_time db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2026-01-08 17:55:32 +02:00
Patryk Jędrzejczak	fad9381560	Merge '[Backport 2025.3] test/raft: use valid sentinel in liveness check to prevent digest errors' from Scylladb[bot] Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches). - (cherry picked from commit `3af5183633`) - (cherry picked from commit `4ba3e90f33`) Parent PR: #28010 Closes scylladb/scylladb#28037 * https://github.com/scylladb/scylladb: test/raft: use valid sentinel in liveness check to prevent digest errors test/raft: improve debugging in randomized_nemesis_test test/raft: improve reporting in the randomized_nemesis_test digest functions	2026-01-08 15:39:21 +01:00
Anna Stuchlik	04f3cff44b	doc: fix the syntax of internal links Some internal links had the wrong syntax: they were formatted as external links. As a result, they redirected the user to the outdated Open Source documentation. This commit fixes that bug. Fixes https://github.com/scylladb/scylladb/issues/25899 Closes scylladb/scylladb#27905 (cherry picked from commit `375479d96c`) Closes scylladb/scylladb#28001	2026-01-08 14:56:53 +02:00
Emil Maskovsky	e4a50b8f45	test/raft: use valid sentinel in liveness check to prevent digest errors Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 (cherry picked from commit `4ba3e90f33`)	2026-01-08 11:17:13 +01:00
Emil Maskovsky	df361a1beb	test/raft: improve debugging in randomized_nemesis_test Move the post-condition check before the assertion to ensure it is always executed first. Before, the wrong value could be passed to the digest_remove assertion, making the pre-check trigger there instead of the post-check as expected. Also, add a check in the append_seq constructor to ensure that the digest value is valid when creating an append_seq object. (cherry picked from commit `3af5183633`)	2026-01-08 11:17:09 +01:00
Emil Maskovsky	693062faeb	test/raft: improve reporting in the randomized_nemesis_test digest functions The Boost ASSERTs in the digest functions of the randomized_nemesis_test were not working well inside the state machine digest functions, leading to unhelpful boost::execution_exception errors that terminated the apply fiber, and didn't provide any helpful information. Replaced by explicit checks with on_fatal_internal_error calls that provide more context about the failure. Also added validation of the digest value after appending or removing an element, which allows to determine which operation resulted in causing the wrong value. This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282, but adds improved error reporting. Refs: scylladb/scylladb#27307 Refs: scylladb/scylladb#17030 (cherry picked from commit `d60b908a8e`)	2026-01-08 11:16:59 +01:00
Aleksandra Martyniuk	2699e401e6	test: rename duplicate tests There are two test with name test_repair_options_hosts_tablets in test/nodetool/test_cluster_repair.py and and two test_repair_keyspace in test/nodetool/test_repair.py. Due to that one of each pair is ignored. Rename the tests so that they are unique. Fixes: https://github.com/scylladb/scylladb/issues/27701. Closes scylladb/scylladb#27720 (cherry picked from commit `bbe64e0e2a`) Closes scylladb/scylladb#27847	2026-01-08 12:08:49 +02:00
Botond Dénes	ff10c6eea6	Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from Tomasz Grabiec Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation). Changes made: 1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation 2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group) 3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups()) 4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures 5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers Rationale: There's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating. Fixes #27475 Closes scylladb/scylladb#27476 * github.com:scylladb/scylladb: replica: Remove unnecessary noexcept replica: Remove noexcept from compaction_groups() functions replica: Remove noexcept from storage_group::for_each_compaction_group (cherry picked from commit `730eca5dac`) (cherry picked from commit `2153308cef`) Closes scylladb/scylladb#27946	2026-01-04 16:54:50 +02:00
Jenkins Promoter	79877a9677	Update pgo profiles - aarch64	2026-01-01 05:13:46 +02:00
Jenkins Promoter	e671ba5d83	Update pgo profiles - x86_64	2026-01-01 04:32:12 +02:00
Gleb Natapov	6a0387bed4	raft topology: Notify that a node was removed only once Raft topology goes over all nodes in a 'left' state and triggers 'remove node' notification in case id/ip mapping is available (meaning the node left recently), but the problem is that, since the mapping is not removed immediately, when multiple nodes are removed in succession a notification for the same node can be sent several times. Fix that by sending notification only if the node still exists in the peers table. It will be removed by the first notification and following notification will not be sent. Closes scylladb/scylladb#27743 (cherry picked from commit `4a5292e815`) Closes scylladb/scylladb#27912	2025-12-30 11:20:26 +01:00
Dario Mirovic	8d48cb0860	test: dtest: audit_test.py: fix audit error log detection `test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py` has an insert statement that is expected to fail. Dtest environment uses `FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails means we expect 6 error audit logs. The test failed because `create keyspace ks` failed once, then succeeded on retry. It allowed the test to proceed properly, but the last part of the test that expects exactly 6 failed queries actually had 7. The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed queries, counting only the query expected to fail. If other queries fail with successful retry, it's fine. If other queries fail without successful retry, the test will fail, as it should in such situations. They are not related to this expected failed insert statement. Fixes #27322 Closes scylladb/scylladb#27378 (cherry picked from commit `f545ed37bc`) Closes scylladb/scylladb#27580	2025-12-29 18:13:22 +02:00
Gleb Natapov	4e48a046c3	topology coordinator: set session id for streaming at the correct time Commit `d3efb3ab6f` added streaming session for rebuild, but it set the session and request submission time. The session should be set when request starts the execution, so this patch moved it to the correct place. Closes scylladb/scylladb#27757 (cherry picked from commit `04976875cc`) Closes scylladb/scylladb#27866	2025-12-28 13:33:30 +02:00
Ferenc Szili	56c65d08f6	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245 (cherry picked from commit `d883ff2317`) Closes scylladb/scylladb#27506	2025-12-23 17:07:33 +02:00
Yaron Kaikov	8efdf6c3ec	auto-backport.py: modify instruction for making PR ready for review Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review Fixes: https://scylladb.atlassian.net/browse/RELENG-152 Closes scylladb/scylladb#27547 (cherry picked from commit `d3e199984e`) Closes scylladb/scylladb#27564	2025-12-22 15:16:32 +02:00
Anna Stuchlik	917b368b38	doc: remove the links to the Download Center This commit removes the remaining links to the Download Center on the website. We no longer use it for installation, and we don't want users to infer that something like that still exists. Fixes https://github.com/scylladb/scylladb/issues/27753 Closes scylladb/scylladb#27756 (cherry picked from commit `f65db4e8eb`) Closes scylladb/scylladb#27783	2025-12-21 19:25:42 +02:00
Emil Maskovsky	431642fc2b	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737 (cherry picked from commit `2a75b1374e`) Closes scylladb/scylladb#27782	2025-12-21 14:13:08 +02:00
Patryk Jędrzejczak	59fdf4b5f0	Merge '[Backport 2025.3] topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted' from Scylladb[bot] In several exception handlers, only `raft::request_aborted` was being caught and rethrown, while `seastar::abort_requested_exception` was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal. For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation" This change adds `seastar::abort_requested_exception` handling alongside `raft::request_aborted` in all places where it was missing. When rethrown, these exceptions propagate up to the main `run()` loop where `handle_topology_coordinator_error()` recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations. Fixes: scylladb/scylladb#27255 No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch. - (cherry picked from commit `37e3dacf33`) Parent PR: #27314 Closes scylladb/scylladb#27662 * https://github.com/scylladb/scylladb: topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands	2025-12-20 19:30:02 +01:00
Emil Maskovsky	bfce02ce7e	topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted In several exception handlers, only raft::request_aborted was being caught and rethrown, while seastar::abort_requested_exception was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal. For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation" This change adds seastar::abort_requested_exception handling alongside raft::request_aborted in all places where it was missing. When rethrown, these exceptions propagate up to the main run() loop where handle_topology_coordinator_error() recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations. Fixes: scylladb/scylladb#27255 (cherry picked from commit `37e3dacf33`)	2025-12-19 16:25:02 +01:00
Patryk Jędrzejczak	8e8e05907b	Merge '[Backport 2025.3] Make direct failure detector verb handler more efficient' from Scylladb[bot] We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group. Fixes https://github.com/scylladb/scylladb/issues/27142 Backport to all version where `60f1053087` was backported to since it should improve performance in large clusters. - (cherry picked from commit `82f80478b8`) - (cherry picked from commit `6a6bbbf1a6`) - (cherry picked from commit `86dde50c0d`) Parent PR: #27387 Closes scylladb/scylladb#27482 * https://github.com/scylladb/scylladb: direct_failure_detector: run direct failure detector in the gossiper scheduling group raft: drop invoke_on from the pinger verb handler direct_failure_detector: pass timeout to direct_fd_ping verb	2025-12-19 11:17:03 +01:00
Aleksandra Martyniuk	57932bda21	repair: throw if flush failed in get_flush_time Currently, _flush_time was stored as a std::optional<gc_clock::time_point> and std::nullopt indicates that the flush was needed but failed. It's confusing for the caller and does not work as expected since the _flush_time is initialized with value (not optional). Change _flush_time type to gc_clock::time_point. If a flush is needed but failed, get_flush_time() throws an exception. This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319 but it was mistakenly overwritten during rebases. Refs: https://github.com/scylladb/scylladb/issues/24415. Closes scylladb/scylladb#26794 (cherry picked from commit `e3e81a9a7a`)	2025-12-17 16:53:25 +01:00
Aleksandra Martyniuk	8309beda47	db: fix indentation (cherry picked from commit `6fc43f27d0`)	2025-12-17 16:39:58 +01:00
Aleksandra Martyniuk	c06b99f218	test: add reproducer for data resurrection Add a reproducer to check that the repair_time isn't updated if the batchlog replay fails. If repair_time was updated, tombstones could be GC'd before the batchlog is replayed. The replay could later cause the data resurrection. (cherry picked from commit `1935268a87`)	2025-12-17 16:39:58 +01:00
Aleksandra Martyniuk	f23d0f556e	repair: fail tablet repair if any batch wasn't sent successfully If any batch replay failed, we cannot update repair_time as we risk the data resurrection. If replay of any batch needs to be retried, run the whole repair but fail at the very end, so that the repair_time for it won't be updated. (cherry picked from commit `d436233209`)	2025-12-17 16:39:56 +01:00
Emil Maskovsky	838ef92141	topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands Ensure all direct and global topology commands rethrow the `raft::request_aborted` exception when aborted, typically due to leadership changes. This makes abortion explicit to callers, enabling proper handling such as retries or workflow termination. This change completes the work started in PR scylladb/scylladb#23962, covering all remaining cases where the exception was not rethrown. Fixes: scylladb/scylladb#23589 (cherry picked from commit `943af1ef1c`)	2025-12-17 16:22:22 +01:00
Aleksandra Martyniuk	b71a7ae359	db/batchlog_manager: fix making decision to skip batch replay Currently, we skip batch replay if less than batch_log_timeout passed from the moment the batch was written. batch_log_timeout value can be configured. If it is large, it won't be replayed for a long time. If the tombstone will be GC'd before the batch is replayed, then we risk the data resurrection. To ensure safety we can skip only the batches that won't be GC'd. In this patch we skip replay of the batches for which: now() < written_at + min(timeout + propagation_delay) repair_time is set as a start of batchlog replay, so at the moment of the check we will have: repair_time <= now() So we know that: repair_time < written_at + propagation_delay With this condition we are sure that GC won't happen. (cherry picked from commit `e1b2180092`)	2025-12-16 15:38:32 +01:00
Aleksandra Martyniuk	af5a3281a2	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated. (cherry picked from commit `7f20b66eff`)	2025-12-16 15:38:32 +01:00
Aleksandra Martyniuk	1167abd6d4	db/batchlog_manager: delete batch with incorrect or unknown version batchlog_manager::replay_all_failed_batches skips batches that have unknown or incorrect version. Next round will process these batches again. Such batches will probably be skipped everytime, so there is no point in keeping them. Even if at some point the version becomes correct, we should not replay the batch - it might be old and this may lead to data resurrection. (cherry picked from commit `904183734f`)	2025-12-16 15:38:32 +01:00
Aleksandra Martyniuk	6f372a6bb7	db/batchlog_manager: coroutinize replay_all_failed_batches (cherry picked from commit `502b03dbc6`)	2025-12-16 15:38:31 +01:00
Michael Litvak	412aa9a19f	view_builder: reduce log level for expected aborts during view creation When draining the view builder, we abort ongoing operations using the view builder's abort source, which may cause them to fail with abort_requested_exception or raft::request_aborted exceptions. Since these failures are expected during shutdown, reduce the log level in add_new_view from 'error' to 'debug' for these specific exceptions while keeping 'error' level for unexpected failures. Closes scylladb/scylladb#26297 (cherry picked from commit `6bc41926e2`) Closes scylladb/scylladb#27537	2025-12-15 10:26:28 +01:00
Jenkins Promoter	f4ad5435a5	Update pgo profiles - aarch64	2025-12-15 05:16:18 +02:00
Jenkins Promoter	7e8f7954bf	Update pgo profiles - x86_64	2025-12-15 04:30:48 +02:00
Benny Halevy	e9c31b82ec	utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout When waiting for the condition variable times out we call on_internal_error, but unfortunately, the backtrace it generates is obfuscated by `coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`. To make the log more useful, print the error injection name and the caller's source_location in the timeout error message. Fixes #27531 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27532 (cherry picked from commit `5f13880a91`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27583	2025-12-12 14:19:39 +01:00
Yaron Kaikov	f5a2fcab72	Add JIRA issue validation to backport PR fixes check Extend the Fixes validation pattern to also accept JIRA issue references (format: [A-Z]+-\d+) in addition to GitHub issue references. This allows backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'. Fixes: https://github.com/scylladb/scylladb/issues/27571 Closes scylladb/scylladb#27572 (cherry picked from commit `3dfa5ebd7f`) Closes scylladb/scylladb#27599	2025-12-12 09:35:49 +02:00
Jenkins Promoter	2fe49ce031	Update ScyllaDB version to: 2025.3.6	2025-12-10 10:46:07 +02:00
Anna Stuchlik	f9d19cab8a	replace the Driver pages with a link to the new Drivers pages This commit removes the now redundant driver pages from the Scylla DB documentation. Instead, the link to the pages where we moved the diver information is added. Also, the links are updated across the ScyllaDB manual. Redirections are added for all the removed pages. Fixes https://github.com/scylladb/scylladb/issues/26871 Closes scylladb/scylladb#27277 (cherry picked from commit `c5580399a8`) Closes scylladb/scylladb#27440	2025-12-10 09:26:11 +01:00
Tomasz Grabiec	750e9da1e8	Merge '[Backport 2025.3] tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Scylladb[bot] Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to reballance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016 - (cherry picked from commit `c9f0a9d0eb`) - (cherry picked from commit `0dcaaa061e`) - (cherry picked from commit `2b03a69065`) Parent PR: #26017 Closes scylladb/scylladb#26218 * github.com:scylladb/scylladb: test: perf: perf-load-balancing: Add parallel-scaleout scenario test: perf: perf-load-balancing: Convert to tool_app_template tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true load_balancer: include dead nodes when calculating rack load	2025-12-09 23:54:20 +01:00
Gleb Natapov	c33c09336b	direct_failure_detector: run direct failure detector in the gossiper scheduling group When direct failure detector was introduces the idea was that it will run on the same connection raft group0 verbs are running, but in `60f1053087` raft verbs were moved to run on the gossiper connection while DIRECT_FD_PING was left where it was. This patch move it to gossiper connection as well and fix the pinger code to run in gossiper scheduling group. (cherry picked from commit `86dde50c0d`)	2025-12-09 17:07:12 +02:00
Gleb Natapov	37010db61a	raft: drop invoke_on from the pinger verb handler Currently raft direct pinger verb jumps to shard 0 to check if group0 is alive before replying. The verb runs relatively often, so it is not very efficient. The patch distributes group0 liveness information (as it changes) to all shard instead, so that the handler itself does not need to jump to shard 0. (cherry picked from commit `6a6bbbf1a6`)	2025-12-09 17:06:06 +02:00
Tomasz Grabiec	ebc07c360f	test: perf: perf-load-balancing: Add parallel-scaleout scenario Simulates reblancing on a single scale-out involving simultaneous addition of multiple nodes per rack. Default parameters create a cluster with 2 racks, 70 tables, 256 tablets/table, 10 nodes, 88 shards/node. Adds 6 nodes in parallel (3 per rack). Current result on my laptop: testlog - Rebalance took 21.874 [s] after 82 iteration(s) (cherry picked from commit `2b03a69065`)	2025-12-09 14:04:19 +01:00
Tomasz Grabiec	b7db86611c	test: perf: perf-load-balancing: Convert to tool_app_template To support sub-commands for testing different scenarios. The current scenario is given the name "rolling-add-dec". (cherry picked from commit `0dcaaa061e`)	2025-12-09 14:04:19 +01:00
Tomasz Grabiec	37824ec021	tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to rebalance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016 (cherry picked from commit `c9f0a9d0eb`)	2025-12-09 14:04:19 +01:00
Wojciech Mitros	ef250e58dd	load_balancer: include dead nodes when calculating rack load Load balancer aims to preserve a balance in rack loads when generating tablet migrations. However, this balance might get broken when dead nodes are present. Currently, these nodes aren't include in rack load calculations, even if they own tablet replicas. As a result, load balancer treats racks with dead nodes as racks with a lower load, so I generates migrations to these racks. This is incorrect, because a dead node might come back alive, which would result in having multiple tablet replicas on the same rack. It's also inefficient even if we know that the node won't come back - when it's being replaced or removed. In that case we know we are going to rebuild the lost tablet replicas so migrating tablets to this rack just doubles the work. Allowing such migrations to happen would also require adjustments in the materialized view pairing code because we'd temporarily allow having multiple tablet replicas on the same rack. So in this patch we include dead nodes when calculating rack loads in the load balancer. The dead nodes still aren't treated as potential migration sources or destinations. We also add a test which verifies that no migrations are performed by doing a node replace with a mv workload in parallel. Before the patch, we'd get pairing errors and after the patch, no pairing errors are detected. Fixes https://github.com/scylladb/scylladb/issues/24485 Closes scylladb/scylladb#26028	2025-12-09 14:04:19 +01:00
Gleb Natapov	28d96c6106	direct_failure_detector: pass timeout to direct_fd_ping verb Currently direct_fd_ping runs without timeout, but the verb is not waited forever, the wait is canceled after a timeout, this timeout simply is not passed to the rpc. It may create a situation where the rpc callback can runs on a destination but it is no longer waited on. Change the code to pass timeout to rpc as well and return earlier from the rpc handler if the timeout is reached by the time the callback is called. This is backwards compatible since timeout is passed as optional. (cherry picked from commit `82f80478b8`)	2025-12-07 14:57:10 +00:00
Tomasz Grabiec	4acf082686	Merge '[Backport 2025.3] address_map: Use more efficient and reliable replication method' from Scylladb[bot] Primary issue with the old method is that each update is a separate cross-shard call, and all later updates queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated, because we update mapping on each change of gossip states. This made bootstrap impossible because nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 - (cherry picked from commit `ed8d127457`) - (cherry picked from commit `4a85ea8eb2`) - (cherry picked from commit `f83c4ffc68`) Parent PR: #26941 Closes scylladb/scylladb#27188 * github.com:scylladb/scylladb: address_map: Use barrier() to wait for replication address_map: Use more efficient and reliable replication method utils: Introduce helper for replicated data structures utils: add "fatal" version of utils::on_internal_error()	2025-12-05 13:23:19 +01:00
Avi Kivity	3f343d70e4	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418 (cherry picked from commit `9696ee64d0`)	2025-12-04 20:18:13 +02:00
Tomasz Grabiec	fed0f95626	address_map: Use barrier() to wait for replication More efficient than 100 pings. There was one ping in test which was done "so this shard notices the clock advance". It's not necessary, since obsering completed SMP call implies that local shard sees the clock advancement done within in. (cherry picked from commit `f83c4ffc68`)	2025-12-04 14:50:31 +01:00
Tomasz Grabiec	eba97f80e0	address_map: Use more efficient and reliable replication method Primary issue with the old method is that each update is a separate cross-shard call, and all later updated queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated. This made bootstrap impossible, since nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Fixes #26835 (cherry picked from commit `4a85ea8eb2`)	2025-12-04 14:50:31 +01:00
Tomasz Grabiec	777b54f072	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards. (cherry picked from commit `ed8d127457`)	2025-12-04 14:50:31 +01:00
Nadav Har'El	6562e844f8	utils: add "fatal" version of utils::on_internal_error() utils::on_internal_error() is a wrapper for Seastar's on_internal_error() which does not require a logger parameter - because it always uses one logger ("on_internal_error"). Not needing a unique logger is especially important when using on_internal_error() in a header file, where we can't define a logger. Seastar also has a another similar function, on_fatal_internal_error(), for which we forgot to implement a "utils" version (without a logger parameter). This patch fixes that oversight. In the next patch, we need to use on_fatal_internal_error() in a header file, so the "utils" version will be useful. We will need the fatal version because we will encounter an unexpected situation during server destruction, and if we let the regular on_internal_error() just throw an exception, we'll be left in an undefined state. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `33476c7b06`)	2025-12-04 14:50:31 +01:00
Łukasz Paszkowski	75f8d5a22c	topology_coordinator: Fix the indentation for the cleanup_target case (cherry picked from commit `6163fedd2e`)	2025-12-04 12:25:16 +01:00
Pavel Emelyanov	1d9e0c17a6	Update seastar submodule (SIGABRT on assertion) * seastar 4431d974f...f61814a48 (1): > util: make SEASTAR_ASSERT() failure generate SIGABRT Fixes #27127 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27403	2025-12-04 13:00:30 +03:00
Łukasz Paszkowski	2fa2c3d792	topology_coordinator: Add barrier to cleanup_target Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 (cherry picked from commit `67f1c6d36c`)	2025-12-04 03:57:10 +00:00
Łukasz Paszkowski	8120fbf5aa	test_node_failure_during_tablet_migration: Increase RF from 2 to 3 The patch prepares the test for additional write workload to be executed in parallel with node failures. With the original RF=2, QUORUM is also 2, which causes writes to fail during node outage. To address it, the third rack with a single node is added and the replication factor is increased to 3. (cherry picked from commit `669286b1d6`)	2025-12-04 03:57:10 +00:00
Ernest Zaslavsky	aa8495d465	s3_client: handle additional transient network errors Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request. Fixes: https://github.com/scylladb/scylladb/issues/27349 Closes scylladb/scylladb#27350 (cherry picked from commit `605f71d074`) Closes scylladb/scylladb#27390	2025-12-03 12:25:15 +03:00
Calle Wilund	7eb51568e9	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236 (cherry picked from commit `59c87025d1`) Closes scylladb/scylladb#27343	2025-12-03 12:25:02 +03:00
Pavel Emelyanov	232cbe2f69	Merge '[Backport 2025.3] tablet: scheduler: Do not emit conflicting migration in merge colocation' from Scylladb[bot] The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 backport to existing releases - this is a bug that can affect correctness - (cherry picked from commit `97b7c03709`) Parent PR: #27312 Closes scylladb/scylladb#27330 * github.com:scylladb/scylladb: tablet: scheduler: Do not emit conflicting migration in merge colocation tablet: scheduler: Do not emit conflicting migrations in the plan	2025-12-03 12:24:48 +03:00
Aleksandra Martyniuk	8a3932e4d9	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165 (cherry picked from commit `19a7d8e248`) Closes scylladb/scylladb#27198	2025-12-03 12:24:29 +03:00
Ernest Zaslavsky	99a51cf695	streaming:: add more logging Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags Fixes: https://github.com/scylladb/scylladb/issues/27299 Closes scylladb/scylladb#27311 (cherry picked from commit `1d5f60baac`) Closes scylladb/scylladb#27341	2025-12-02 12:13:21 +01:00
Jenkins Promoter	a99b7020dd	Update pgo profiles - aarch64	2025-12-01 05:12:52 +02:00
Jenkins Promoter	8456b9520b	Update pgo profiles - x86_64	2025-12-01 04:35:55 +02:00
Michael Litvak	4ac402c5c5	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312 (cherry picked from commit `97b7c03709`)	2025-11-30 10:13:42 +01:00
Tomasz Grabiec	951d3f50ea	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048 (cherry picked from commit `981592bca5`)	2025-11-30 10:00:22 +01:00
Patryk Jędrzejczak	a07e0d46ae	Merge '[Backport 2025.3] locator/node: include _excluded in missing places' from Scylladb[bot] We currently ignore the `_excluded` field in `node::clone()` and the verbose formatter of `locator::node`. The first one is a bug that can have unpredictable consequences on the system. The second one can be a minor inconvenience during debugging. We fix both places in this PR. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72 This PR is a bugfix that should be backported to all supported branches. - (cherry picked from commit `4160ae94c1`) - (cherry picked from commit `287c9eea65`) Parent PR: #27265 Closes scylladb/scylladb#27290 * https://github.com/scylladb/scylladb: locator/node: include _excluded in verbose formatter locator/node: preserve _excluded in clone()	2025-11-27 12:29:19 +01:00
Avi Kivity	69871fe600	Merge '[Backport 2025.3] fix notification about expiring erm held for to long' from Scylladb[bot] Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should. Fixes https://github.com/scylladb/scylladb/issues/27141 - (cherry picked from commit `9f97c376f1`) - (cherry picked from commit `5dcdaa6f66`) Parent PR: #27140 Closes scylladb/scylladb#27275 * github.com:scylladb/scylladb: test: test that expired erm that held for too long triggers notification token_metadata: fix notification about expiring erm held for to long	2025-11-27 12:10:38 +02:00
Patryk Jędrzejczak	2307bf891d	locator/node: include _excluded in verbose formatter It can be helpful during debugging. (cherry picked from commit `287c9eea65`)	2025-11-26 23:04:48 +00:00
Patryk Jędrzejczak	f288273ef0	locator/node: preserve _excluded in clone() We currently ignore the `_excluded` field in `clone()`. Losing information about exclusion can have unpredictable consequences. One observed effect (that led to finding this issue) is that the `/storage_service/nodes/excluded` API endpoint sometimes misses excluded nodes. (cherry picked from commit `4160ae94c1`)	2025-11-26 23:04:48 +00:00
Gleb Natapov	aa75444438	test: test that expired erm that held for too long triggers notification (cherry picked from commit `5dcdaa6f66`)	2025-11-26 15:08:41 +00:00
Gleb Natapov	e2d59df166	token_metadata: fix notification about expiring erm held for to long Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix assign operator to call destructor. (cherry picked from commit `9f97c376f1`)	2025-11-26 15:08:41 +00:00
Ernest Zaslavsky	7e6b653e5c	streaming: fix loop break condition in tablet_sstable_streamer::stream Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved. (cherry picked from commit `dedc8bdf71`) Closes scylladb/scylladb#27153 Fixes: #26979 Parent PR: #26980 Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools	2025-11-25 11:59:01 +03:00
Avi Kivity	84b7e06268	tools: toolchain: prepare: replace 'reg' with 'skopeo' The prepare scripts uses 'reg' to verify we're not going to overwrite an existing image. The 'reg' command is not available in Fedora 43. Use 'skopeo' instead. Skopeo is part of the podman ecosystem so hopefully will live longer. Fixes #27178. Closes scylladb/scylladb#27179 (cherry picked from commit `d6ef5967ef`) Closes scylladb/scylladb#27199	2025-11-24 16:32:04 +02:00
Jenkins Promoter	812fc721cd	Update ScyllaDB version to: 2025.3.5	2025-11-24 15:50:44 +02:00
Raphael S. Carvalho	867cb1e7ac	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078 (cherry picked from commit `74ecedfb5c`) Closes scylladb/scylladb#27155	2025-11-21 17:48:21 +03:00
Patryk Jędrzejczak	a9fc235aee	test: test_raft_recovery_stuck: ensure mutual visibility before using driver Not waiting for nodes to see each other as alive can cause the driver to fail the request sent in `wait_for_upgrade_state()`. scylladb/scylladb#19771 has already replaced concurrent restarts with `ManagerClient.rolling_restart()`, but it has missed this single place, probably because we do concurrent starts here. Fixes #27055 Closes scylladb/scylladb#27075 (cherry picked from commit `e35ba974ce`) Closes scylladb/scylladb#27109	2025-11-20 10:41:58 +02:00
Botond Dénes	78ecb8854a	Merge '[Backport 2025.3] Automatic cleanup improvements' from Scylladb[bot] This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. - (cherry picked from commit `e872f9cb4e`) - (cherry picked from commit `0f0ab11311`) Parent PR: #26868 Closes scylladb/scylladb#27093 * github.com:scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-20 10:41:04 +02:00
Botond Dénes	d2d9140029	Merge '[Backport 2025.3] encryption::kms_host: Add exponential backoff-retry for 503 errors' from Scylladb[bot] Refs #26822 Fixes #27062 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. - (cherry picked from commit `190e3666cb`) - (cherry picked from commit `d22e0acf0b`) Parent PR: #26934 Closes scylladb/scylladb#27063 * github.com:scylladb/scylladb: encryption::kms_host: Add exponential backoff-retry for 503 errors encryption::kms_host: Include http error code in kms_error	2025-11-20 10:40:20 +02:00
Botond Dénes	91e6efdde8	Merge '[Backport 2025.3] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot] The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. - (cherry picked from commit `c0f7622d12`) - (cherry picked from commit `222eab45f8`) - (cherry picked from commit `394207fd69`) - (cherry picked from commit `b357c8278f`) Parent PR: #26856 Closes scylladb/scylladb#27039 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-20 10:39:48 +02:00
Botond Dénes	a067723f55	Merge '[Backport 2025.3] cdc: set column drop timestamp in the future' from Scylladb[bot] When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes https://github.com/scylladb/scylladb/issues/26340 the issue affects all previous releases, backport to improve stability - (cherry picked from commit `eefae4cc4e`) - (cherry picked from commit `48298e38ab`) - (cherry picked from commit `039323d889`) - (cherry picked from commit `e85051068d`) Parent PR: #26533 Closes scylladb/scylladb#27036 * github.com:scylladb/scylladb: test: test concurrent writes with column drop with cdc preimage cdc: check if recreating a column too soon cdc: set column drop timestamp in the future	2025-11-20 10:39:18 +02:00
Gleb Natapov	b53bf43844	cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster `97ab3f6622` changed "nodetool cleanup" (without arguments) to run cleanup on all dirty nodes in the cluster. This was somewhat unexpected, so this patch changes it back to run cleanup on the target node only (and reset "cleanup needed" flag afterwards) and it adds "nodetool cluster cleanup" command that runs the cleanup on all dirty nodes in the cluster. (cherry picked from commit `0f0ab11311`)	2025-11-19 10:53:42 +02:00
Gleb Natapov	3d60e5e825	cleanup: Add RESTful API to allow reset cleanup needed flag Cleaning up a node using per keyspace/table interface does not reset cleanup needed flag in the topology. The assumption was that running cleanup on already clean node does nothing and completes quickly. But due to https://github.com/scylladb/scylladb/issues/12215 (which is closed as WONTFIX) this is not the case. This patch provides the ability to reset the flag in the topology if operator cleaned up the node manually already. (cherry picked from commit `e872f9cb4e`)	2025-11-19 10:44:30 +02:00
Avi Kivity	e9e849c2bf	Merge '[Backport 2025.3] Synchronize tablet split and load-and-stream' from Scylladb[bot] Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. - (cherry picked from commit `3abc66da5a`) - (cherry picked from commit `4654cdc6fd`) Parent PR: #26456 Closes scylladb/scylladb#26648 * github.com:scylladb/scylladb: sstables_loader: Don't bypass synchronization with busy topology test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream	2025-11-17 17:14:36 +02:00
Calle Wilund	484e7aed2c	encryption::kms_host: Add exponential backoff-retry for 503 errors Refs #26822 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. v2: * Use utils::exponential_backoff_retry (cherry picked from commit `d22e0acf0b`)	2025-11-17 11:48:42 +00:00
Calle Wilund	77407fd704	encryption::kms_host: Include http error code in kms_error Keep track of actual HTTP failure. (cherry picked from commit `190e3666cb`)	2025-11-17 11:48:41 +00:00
Benny Halevy	898f193ef6	scylla-sstable: correctly dump sharding_metadata This patch fixes 2 issues at one go: First, Currently sstables::load clears the sharding metadata (via open_data()), and so scylla-sstable always prints an empty array for it. Second, printing token values would generate invalid json as they are currently printed as binary bytes, and they should be printed simply as numbers, as we do elsewhere, for example, for the first and last keys. Fixes #26982 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26991 (cherry picked from commit `f9ce98384a`) Closes scylladb/scylladb#27037	2025-11-16 15:43:35 +02:00
Michael Litvak	ba40c1eba9	test: test concurrent writes with column drop with cdc preimage add a test that writes to a table concurrently with dropping a column, where the table has CDC enabled with preimage. the test reproduces issue #26340 where this results in a malformed sstable. (cherry picked from commit `e85051068d`)	2025-11-16 10:03:07 +01:00
Michael Litvak	28eaa12af9	cdc: check if recreating a column too soon When we drop a column from a CDC log table, we set the column drop timestamp a few seconds into the future. This can cause unexpected problems if a user tries to recreate a CDC column too soon, before the drop timestamp has passed. To prevent this issue, when creating a CDC column we check its creation timestamp against the existing drop timestamp, if any, and fail with an informative error if the recreation attempt is too soon. (cherry picked from commit `039323d889`)	2025-11-16 10:03:07 +01:00
Michael Litvak	c37d224db6	cdc: set column drop timestamp in the future When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes scylladb/scylladb#26340 (cherry picked from commit `48298e38ab`)	2025-11-16 09:34:51 +01:00
Dawid Mędrek	7b32c277fe	test/cluster/test_maintenance_mode.py: Wait for initialization If we try to perform queries too early, before the call to `storage_service::start_maintenance_mode` has finished, we will fail with the following error: ``` ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index! ``` To avoid that, we should wait until initialization is complete. (cherry picked from commit `b357c8278f`)	2025-11-15 22:10:28 +00:00
Dawid Mędrek	6d6f870a5f	test: Disable maintenance mode correctly in test_maintenance_mode.py Although setting the value of `maintenance_mode` to the string `"false"` disables maintenance mode, the testing framework misinterprets the value and thinks that it's actually enabled. As a result, it might try to connect to Scylla via the maintenance socket, which we don't want. (cherry picked from commit `394207fd69`)	2025-11-15 22:10:28 +00:00
Dawid Mędrek	7112e0bfba	test: Fix keyspace in test_maintenance_mode.py The keyspace used in the test is not necessarily called `ks`. (cherry picked from commit `222eab45f8`)	2025-11-15 22:10:28 +00:00
Dawid Mędrek	c96bd48fd0	service/qos: Do not crash Scylla if auth_integration absent If the user connects to Scylla via the maintenance socket, it may happen that `auth_integration` has not been registered in the service level controller yet. One example is maintenance mode when that will never happen; another when the connection occurs before Scylla is fully initialized. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. In those cases, we completely circumvent any calls to `auth_integration` and handle them separately. The modified methods are: * `get_user_scheduling_group`, * `with_user_service_level`, * `describe_service_levels`. For the first two, the new behavior is in line with the previous implementation of those functions. The last behaves differently now, but since it's a soft error, crashing the node is not necessary anyway. We throw an exception instead, whose error message should give the user a hint of what might be wrong. The other uses of `auth_integration` within the service level controller are not problematic: * `find_effective_service_level`, * `find_cached_effective_service_level`. They take the name of a role as their argument. Since the anonymous role doesn't have a name, it's not possible to call them with it. Fixes scylladb/scylladb#26816 (cherry picked from commit `c0f7622d12`)	2025-11-15 22:10:28 +00:00
Jenkins Promoter	e6e3678e00	Update pgo profiles - aarch64	2025-11-15 05:11:05 +02:00
Jenkins Promoter	b5f03af147	Update pgo profiles - x86_64	2025-11-15 04:30:41 +02:00
Aleksandra Martyniuk	bec3b87032	api: storage_service: tasks: unify upgrade_sstable Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace} and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}. (cherry picked from commit `fdd623e6bc`)	2025-11-14 15:24:43 +01:00
Aleksandra Martyniuk	6a72fd4bb4	api: storage_service: tasks: force_keyspace_cleanup Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of /storage_service/keyspace_cleanup/{keyspace} and /tasks/compaction/keyspace_cleanup/{keyspace}. (cherry picked from commit `044b001bb4`)	2025-11-14 15:18:54 +01:00
Aleksandra Martyniuk	998d186709	pi: storage_service: tasks: unify force_keyspace_compaction Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace}, to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}). Unify the handlers of both apis. (cherry picked from commit `12dabdec66`)	2025-11-14 15:18:48 +01:00
Raphael S. Carvalho	d63e9342ef	sstables_loader: Don't bypass synchronization with busy topology The patch `c543059f86` fixed the synchronization issue between tablet split and load-and-stream. The synchronization worked only with raft topology, and therefore was disabled with gossip. To do the check, storage_service::raft_topology_change_enabled() but the topology kind is only available/set on shard 0, so it caused the synchronization to be bypassed when load-and-stream runs on any shard other than 0. The reason the reproducer didn't catch it is that it was restricted to single cpu. It will now run with multi cpu and catch the problem observed. Fixes #22707 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#26730 (cherry picked from commit `7f34366b9d`) (cherry picked from commit `e8a74d0fb3`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-11-14 10:49:30 -03:00
Botond Dénes	420242646b	Merge '[Backport 2025.3] [schema] Speculative retry rounding fix' from Scylladb[bot] This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879 ](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too. Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported. Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit. Fixes #26369 This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1. - (cherry picked from commit `da2ac90bb6`) - (cherry picked from commit `5d1913a502`) - (cherry picked from commit `aba4c006ba`) - (cherry picked from commit `85f059c148`) - (cherry picked from commit `7ec9e23ee3`) Parent PR: #26909 Closes scylladb/scylladb#27014 * github.com:scylladb/scylladb: test: cqlpy: add test case for non-numeric PERCENTILE value schema: speculative_retry: update exception type for sstring ops docs: cql: ddl.rst: update speculative-retry-options test: cqlpy: add test for valid speculative_retry values schema: speculative_retry: allow 0 and 100 PERCENTILE values	2025-11-14 10:32:19 +02:00
Botond Dénes	8dd5cc3891	Merge '[Backport 2025.3] cql3: Fix std::bad_cast when deserializing vectors of collections' from Scylladb[bot] cql3: Fix std::bad_cast when deserializing vectors of collections This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`. The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference. This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct. To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable. Fixes: #26704 This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`. - (cherry picked from commit `960fe3da60`) - (cherry picked from commit `77da4517d2`) Parent PR: #26740 Closes scylladb/scylladb#26997 * github.com:scylladb/scylladb: cql3: Make abstract_type explicitly noncopyable cql3: Fix std::bad_cast when deserializing vectors of collections	2025-11-14 10:30:55 +02:00
Yaron Kaikov	d4861c8068	install-dependencies.sh: update node_exporter to 1.10.2 Update node exporter to solve CVE-2025-22871 [regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz ] Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5 Closes scylladb/scylladb#26916 (cherry picked from commit `c601371b57`) Closes scylladb/scylladb#26952	2025-11-14 10:28:56 +02:00
Dario Mirovic	143b903203	test: cqlpy: add test case for non-numeric PERCENTILE value Add test case for non-numeric PERCENTILE value, which raises an error different to the out-of-range invalid values. Regex in the test test_invalid_percentile_speculative_retry_values is expanded. Refs #26369 (cherry picked from commit `7ec9e23ee3`)	2025-11-13 19:44:43 +00:00
Dario Mirovic	6237b13959	schema: speculative_retry: update exception type for sstring ops Change speculative_retry::to_sstring and speculative_retry::from_sstring to throw exceptions::configuration_exception instead of std::invalid_argument. These errors can be triggered by CQL, so appropriate CQL exception should be used. Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304 Refs #26369 (cherry picked from commit `85f059c148`)	2025-11-13 19:44:43 +00:00
Dario Mirovic	ee0f821ed2	docs: cql: ddl.rst: update speculative-retry-options Clarify how the value of `XPERCENTILE` is handled: - Values 0 and 100 are supported - The percentile value is rounded to the nearest 0.1 (1 decimal place) Refs #26369 (cherry picked from commit `aba4c006ba`)	2025-11-13 19:44:43 +00:00
Dario Mirovic	8b1547df9c	test: cqlpy: add test for valid speculative_retry values test_valid_percentile_speculative_retry_values is introduced to test that valid values for speculative_retry are properly accepted. Some of the values are moved from the test_invalid_percentile_speculative_retry_values test, because the previous commit added support for them. Refs #26369 (cherry picked from commit `5d1913a502`)	2025-11-13 19:44:43 +00:00
Dario Mirovic	f75c15e076	schema: speculative_retry: allow 0 and 100 PERCENTILE values This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry. It was possible to specify these values before #21825. #21825 prevented specifying invalid values, like -1 and 101, but also prevented using 0 and 100. On top of that, speculative_retry::to_sstring function did rounding when formatting the string, which introduced inconsistency. Fixes #26369 (cherry picked from commit `da2ac90bb6`)	2025-11-13 19:44:43 +00:00
Karol Nowacki	b78c9ec5de	cql3: Make abstract_type explicitly noncopyable The polymorphic abstract_type class serves as an interface and should not be copied. To prevent accidental and unsafe copies, make it explicitly uncopyable. (cherry picked from commit `77da4517d2`)	2025-11-13 11:51:22 +01:00
Karol Nowacki	a8135cf239	cql3: Fix std::bad_cast when deserializing vectors of collections When deserializing a vector whose elements are collections (e.g., set, list), the operation raises a `std::bad_cast` exception. This was caused by type slicing due to an incorrect assignment of a polymorphic type by value instead of by reference. This resulted in a failed `dynamic_cast` even when the underlying type was correct. (cherry picked from commit `960fe3da60`)	2025-11-13 11:51:18 +01:00
Raphael S. Carvalho	bf359388b1	test: Add reproducer for l-a-s and split synchronization issue Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `4654cdc6fd`)	2025-11-12 22:16:41 -03:00
Raphael S. Carvalho	d3ce390e4d	sstables_loader: Synchronize tablet split and load-and-stream Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements #1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes #26455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `3abc66da5a`)	2025-11-12 22:16:38 -03:00
Yaron Kaikov	9a3d5da553	auto-backport: Add support for JIRA issue references - Added support for JIRA issue references in PR body and commit messages - Supports both short format (PKG-92) and full URL format - Maintains existing GitHub issue reference support - JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID} - Allows backporting for PRs that reference JIRA issues with 'fixes' keyword Fixes: https://github.com/scylladb/scylladb/issues/26955 Closes scylladb/scylladb#26954 (cherry picked from commit `3ade3d8f5b`) Closes scylladb/scylladb#26965	2025-11-12 22:37:09 +02:00
Botond Dénes	e6b721dfd6	service/storage_proxy: send batches with CL=EACH_QUORUM Batches that fail on the initial send are retired later, until they succeed. These retires happen with CL=ALL, regardless of what the original CL of the batch was. This is unnecessarily strict. We tried to follow Cassandra here, but Cassandra has a big caveat in their use of CL=ALL for batches. They accept saving just a hint for any/all of the endpoints, so a batch which was just logged in hints is good enough for them. We do not plan on replicating this usage of hints at this time, so as a middle ground, the CL is changed to EACH_QUORUM. Fixes: scylladb/scylladb#25432 Closes scylladb/scylladb#26304 (cherry picked from commit `d9c3772e20`) Closes scylladb/scylladb#26929	2025-11-11 10:38:11 +03:00
Ran Regev	bd526cb341	nodetool refresh primary-replica-only Fixes: #26440 1. Added description to primary-replica-only option 2. Fixed code text to better reflect the constrained cheked in the code itself. namely: that both primary replica only and scope must be applied only if load and steam is applied too, and that they are mutual exclusive to each other. Note: when https://github.com/scylladb/scylladb/issues/26584 is implemented (with #26609) there will be a need to align the docs as well - namely, primary-replica-only and scope will no longer be mutual exclusive Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#26480 (cherry picked from commit `aaf53e9c42`) Closes scylladb/scylladb#26905	2025-11-11 10:37:58 +03:00
Piotr Dulikowski	aaeb937359	Merge '[Backport 2025.3] transport: call update_scheduling_group for non-auth connections' from Andrzej Jackowski This is backport of fix for https://github.com/scylladb/scylladb/issues/26040 and related test (https://github.com/scylladb/scylladb/pull/26589) to 2025.3. Before this change, unauthorized connections stayed in main scheduling group. It is not ideal, in such case, rather sl:default should be used, to have a consistent behavior with a scenario where users is authenticated but there is no service level assigned to the user. This commit adds a call to update_scheduling_group at the end of connection creation for an unauthenticated user, to make sure the service level is switched to sl:default. Fixes: https://github.com/scylladb/scylladb/issues/26040 Fixes: https://github.com/scylladb/scylladb/issues/26581 (cherry picked from commit `278019c328`) (cherry picked from commit `8642629e8e`) No backport, as it's already a backport (but similar PRs will be created for 2025.4) Closes scylladb/scylladb#26814 * github.com:scylladb/scylladb: test: add test_anonymous_user to test_raft_service_levels transport: call update_scheduling_group for non-auth connections	2025-11-09 00:03:57 +01:00
Jenkins Promoter	508d06e264	Update ScyllaDB version to: 2025.3.4	2025-11-04 12:06:50 +02:00
Jenkins Promoter	a29329d418	Update pgo profiles - aarch64	2025-11-01 05:15:33 +02:00
Jenkins Promoter	2cb0354170	Update pgo profiles - x86_64	2025-11-01 04:55:45 +02:00
Andrzej Jackowski	8b15a6ee50	test: add test_anonymous_user to test_raft_service_levels The primary goal of this test is to reproduce scylladb/scylladb#26040 so the fix (`278019c328`) can be backported to older branches. Scenario: connect via CQL as an anonymous user and verify that the `sl:default` scheduling group is used. Before the fix for #26040 `main` scheduling group was incorrectly used instead of `sl:default`. Control connections may legitimately use `sl:driver`, so the test accepts those occurrences while still asserting that regular anonymous queries use `sl:default`. This adds explicit coverage on master. After scylladb#24411 was implemented, some other tests started to fail when scylladb#26040 was unfixed. However, none of the tests asserted this exact behavior. Refs: scylladb/scylladb#26040 Refs: scylladb/scylladb#26581 Closes scylladb/scylladb#26589 (cherry picked from commit `8642629e8e`)	2025-10-30 18:39:44 +01:00
Andrzej Jackowski	17f724f221	transport: call update_scheduling_group for non-auth connections Before this change, unauthorized connections stayed in `main` scheduling group. It is not ideal, in such case, rather `sl:default` should be used, to have a consistent behavior with a scenario where users is authenticated but there is no service level assigned to the user. This commit adds a call to `update_scheduling_group` at the end of connection creation for an unauthenticated user, to make sure the service level is switched to `sl:default`. Fixes: scylladb/scylladb#26040 (cherry picked from commit `278019c328`)	2025-10-30 18:38:43 +01:00
Pavel Emelyanov	5eb6da551f	Merge '[Backport 2025.3] db/config: Add SSTable compression options for user tables' from Scylladb[bot] ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well. This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality. Fixes #25195. - (cherry picked from commit `1106157756`) - (cherry picked from commit `ea41f652c4`) - (cherry picked from commit `a7e46974d4`) - (cherry picked from commit `e1d9c83406`) - (cherry picked from commit `8d5bd212ca`) - (cherry picked from commit `6ba0fa20ee`) - (cherry picked from commit `8410532fa0`) Parent PR: #26003 Closes scylladb/scylladb#26301 * github.com:scylladb/scylladb: test/cluster: Add tests for invalid SSTable compression options test/boost: Add tests for SSTable compression config options main: Validate SSTable compression options from config db/config: Add SSTable compression options for user tables db/config: Prepare compression_parameters for config system compressor: Validate presence of sstable_compression in parameters compressor: Add missing space in exception message	2025-10-30 10:31:16 +03:00
Pavel Emelyanov	0e6381f14d	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595 (cherry picked from commit `d9bfbeda9a`) Closes scylladb/scylladb#26764	2025-10-29 11:34:47 +02:00
Anna Stuchlik	e154b18786	doc: add --list-active-releases to Web Installer Fixes https://github.com/scylladb/scylladb/issues/26688 V2 of https://github.com/scylladb/scylladb/pull/26687 Closes scylladb/scylladb#26689 (cherry picked from commit `bd5b966208`) Closes scylladb/scylladb#26760	2025-10-29 11:33:55 +02:00
Patryk Jędrzejczak	fd1e1d506d	test: test_raft_recovery_stuck: reconnect driver after rolling restarts It turns out that #21477 wasn't sufficient to fix the issue. The driver may still decide to reconnect the connection after `rolling_restart` returns. One possible explanation is that the driver sometimes handles the DOWN notification after all nodes consider each other UP. Reconnecting the driver after restarting nodes seems to be a reliable workaround that many tests use. We also use it here. Fixes #19959 Closes scylladb/scylladb#26638 (cherry picked from commit `5321720853`) Closes scylladb/scylladb#26758	2025-10-29 11:33:06 +02:00
Anna Stuchlik	0905de5668	doc: add support for Debian 12 Fixes https://github.com/scylladb/scylladb/issues/26640 Closes scylladb/scylladb#26668 (cherry picked from commit `9c0ff7c46b`) Closes scylladb/scylladb#26679	2025-10-29 11:32:21 +02:00
Pavel Emelyanov	dbf0ec460d	Update seastar submodule * seastar 26badcb14...4431d974f (1): > Merge '[Backport 2025.3] all commits required for enabling i7i support' from Robert Bindar split random io buffer size in 2 options Fix hang in io_queue for big write ioproperties numbers Fix incorrect defaults for io queue iops/bandwidth iotune: fix very long warm up duration on systems with high cpu count Add iotune --get-best-iops-with-buffer-sizes option Add sequential buffer size options to IOTune iotune: Ignore measurements during warmup period iotune: Fix warmup calculation bug and botched rebase Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26725	2025-10-28 13:10:12 +03:00
Robert Bindar	5e2dd4ecb3	Make scylla_io_setup detect request size for best write IOPS We noticed during work on scylladb/seastar#2802 that on i7i family (later proved that it's valid for i4i family as well), the disks are reporting the physical sector sizes incorrectly as 512bytes, whilst we proved we can render much better write IOPS with 4096bytes. This is not the case on AWS i3en family where the reported 512bytes physical sector size is also the size we can achieve the best write IOPS. This patch works around this issue by changing `scylla_io_setup` to parse the instance type out of `/sys/devices/virtual/dmi/id/product_name` and run iotune with the correct request size based on the instance type. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#25315 (cherry picked from commit `2c74a6981b`) Closes scylladb/scylladb#26714	2025-10-27 16:53:44 +03:00
Patryk Jędrzejczak	bb99ae205c	Merge '[Backport 2025.3] raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Scylladb[bot] Group0 tombstone GC considers only the current group 0 members while computing the group 0 tombstone GC time. It's not enough because in the Raft-based recovery procedure, there can be nodes that haven't joined the current group 0 yet, but they have belonged to a different group 0 and thus have a non-empty group 0 state ID. The current code can cause a data resurrection in group 0 tables. We fix this issue in this PR and add a regression test. This issue was uncovered by `test_raft_recovery_entry_loss`, which became flaky recently. We skipped this test for now. We will unskip it in a following PR because it's skipped only on master, while we want to backport this PR. Fixes #26534 This PR contains an important bugfix, so we should backport it to all branches with the Raft-based recovery procedure (2025.2 and newer). - (cherry picked from commit `1d09b9c8d0`) - (cherry picked from commit `6b2e003994`) - (cherry picked from commit `c57f097630`) Parent PR: #26612 Closes scylladb/scylladb#26680 * https://github.com/scylladb/scylladb: test: test group0 tombstone GC in the Raft-based recovery procedure group0_state_id_handler: remove unused group0_server_accessor group0_state_id_handler: consider state IDs of all non-ignored topology members	2025-10-27 10:20:16 +01:00
Patryk Jędrzejczak	ab843eb034	test: test group0 tombstone GC in the Raft-based recovery procedure We add a regression test for the bug fixed in the previous commits. (cherry picked from commit `c57f097630`)	2025-10-24 11:54:37 +02:00
Pavel Emelyanov	1ee781230b	Merge '[Backport 2025.3] s3_client: handle failures which require http::request updating' from Scylladb[bot] Apply two main changes to the s3_client error handling 1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header 2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether Fixes: https://github.com/scylladb/scylladb/issues/26483 Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions - (cherry picked from commit `55fb2223b6`) - (cherry picked from commit `db1ca8d011`) - (cherry picked from commit `185d5cd0c6`) - (cherry picked from commit `116823a6bc`) - (cherry picked from commit `43acc0d9b9`) - (cherry picked from commit `58a1cff3db`) - (cherry picked from commit `1d34657b14`) - (cherry picked from commit `4497325cd6`) - (cherry picked from commit `fdd0d66f6e`) Parent PR: #26527 This backport diverges from the original PR patch, as the 2025.3 release lacks the required Seastar changes. Namely, there is no overload for make_request in this version of the Seastar which accepts const& to the request argument. Thus here it's handled by removing constness from request arguments when calling http's make_request Closes scylladb/scylladb#26649 * https://github.com/scylladb/scylladb: s3_client: tune logging level s3_client: add logging s3_client: improve exception handling for chunked downloads s3_client: fix indentation s3_client: add max for client level retries s3_client: remove `s3_retry_strategy` s3_client: support high-level request retries s3_client: just reformat `make_request` s3_client: unify `make_request` implementation	2025-10-23 10:22:22 +03:00
Andrei Chekun	14583c2921	test.py: rewrite the wait_for_first_completed Rewrite wait_for first_completed to return only first completed task guarantee of awaiting(disappearing) all cancelled and finished tasks Use wait_for_first_completed to avoid false pass tests in the future and issues like #26148 Use gather_safely to await tasks and removing warning that coroutine was not awaited Closes scylladb/scylladb#26435 (cherry picked from commit `24d17c3ce5`) Closes scylladb/scylladb#26662	2025-10-22 21:31:47 +02:00
Patryk Jędrzejczak	ceec9b5508	group0_state_id_handler: remove unused group0_server_accessor It became unused in the previous commit. (cherry picked from commit `6b2e003994`)	2025-10-22 17:12:57 +00:00
Patryk Jędrzejczak	9c809db181	group0_state_id_handler: consider state IDs of all non-ignored topology members It's not enough to consider only the current group 0 members. In the Raft-based recovery procedure, there can be nodes that haven't joined the current group 0 yet, but they have belonged to a different group 0 and thus have a non-empty group 0 state ID. We fix this issue in this commit by considering topology members instead. We don't consider ignored nodes as an optimization. When some nodes are dead, the group 0 state ID handler won't have to wait until all these nodes leave the cluster. It will only have to wait until all these nodes are ignored, which happens at the beginning of the first removenode/replace. As a result, tombstones of group 0 tables will be purged much sooner. We don't rename the `group0_members` variable to keep the change minimal. There seems to be no precise and succinct name for the used set of nodes anyway. We use `std::ranges::join_view` in one place because: - `std::ranges::concat` will become available in C++26, - `boost::range::join` is not a good option, as there is an ongoing effort to minimize external dependencies in Scylla. (cherry picked from commit `1d09b9c8d0`)	2025-10-22 17:12:57 +00:00
Ernest Zaslavsky	c39c560bc3	s3_client: tune logging level Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs. (cherry picked from commit `fdd0d66f6e`)	2025-10-22 15:24:06 +03:00
Ernest Zaslavsky	fa3f309877	s3_client: add logging Add logging for the case when we encounter expired credentials, shouldnt happen but just in case (cherry picked from commit `4497325cd6`)	2025-10-22 15:24:06 +03:00
Ernest Zaslavsky	aca20f5ca5	s3_client: improve exception handling for chunked downloads Refactor the wrapping exception used in `chunked_download_source` to prevent the retry strategy from reattempting failed requests. The new implementation preserves the original `exception_ptr`, making the root cause clearer and easier to diagnose. (cherry picked from commit `1d34657b14`)	2025-10-22 15:24:06 +03:00
Ernest Zaslavsky	898f0ebe5e	s3_client: fix indentation Reformat `client::make_request` to fix the indentation of `if` block (cherry picked from commit `58a1cff3db`)	2025-10-22 15:24:06 +03:00
Ernest Zaslavsky	c89bed0a85	s3_client: add max for client level retries To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request` (cherry picked from commit `43acc0d9b9`)	2025-10-22 15:24:05 +03:00
Ernest Zaslavsky	779a45e2c9	s3_client: remove `s3_retry_strategy` It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request (cherry picked from commit `116823a6bc`)	2025-10-22 15:24:05 +03:00
Ernest Zaslavsky	85102711ba	s3_client: support high-level request retries Add an option to retry S3 requests at the highest level, including reinitializing headers and reauthenticating. This addresses cases where retrying the same request fails, such as when the S3 server rejects a timestamp older than 15 minutes. (cherry picked from commit `185d5cd0c6`)	2025-10-22 15:24:05 +03:00
Asias He	ff94e2d96b	repair: Fix uuid and nodes_down order in the log Fixes #26536 Closes scylladb/scylladb#26547 (cherry picked from commit `33bc1669c4`) Closes scylladb/scylladb#26629	2025-10-22 11:30:44 +03:00
Ernest Zaslavsky	c1d53eee92	s3_client: just reformat `make_request` Just reformat previously changed methods to improve readability (cherry picked from commit `db1ca8d011`)	2025-10-21 12:26:15 +00:00
Ernest Zaslavsky	268e6720a8	s3_client: unify `make_request` implementation Refactor `make_request` to use a single core implementation that handles authentication and issues the HTTP request. All overloads now delegate to this unified method. (cherry picked from commit `55fb2223b6`)	2025-10-21 12:26:15 +00:00
Botond Dénes	6dde7a9b84	Merge '[Backport 2025.3] raft topology: disable schema pulls in the Raft-based recovery procedure' from Scylladb[bot] Schema pulls should always be disabled when group 0 is used. However, `migration_manager::disable_schema_pulls()` is never called during a restart with `recovery_leader` set in the Raft-based recovery procedure, which causes schema pulls to be re-enabled on all live nodes (excluding the nodes replacing the dead nodes). Moreover, schema pulls remain enabled on each node until the node is restarted, which could be a very long time. We fix this issue and add a regression test in this PR. Fixes #26569 This is an important bug fix, so it should be backported to all branches with the Raft-based recovery procedure (2025.2 and newer branches). - (cherry picked from commit `ec3a35303d`) - (cherry picked from commit `da8748e2b1`) - (cherry picked from commit `71de01cd41`) Parent PR: #26572 Closes scylladb/scylladb#26597 * github.com:scylladb/scylladb: test: test_raft_recovery_entry_loss: fix the typo in the test case name test: verify that schema pulls are disabled in the Raft-based recovery procedure raft topology: disable schema pulls in the Raft-based recovery procedure	2025-10-20 10:41:40 +03:00
Nikos Dragazis	3143b0eebb	test/cluster: Add tests for invalid SSTable compression options Complementary to the previous patch. It triggers semantic validation checks in `compression_parameters::validate()` and expects the server to exit. The tests examine both command line and YAML options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `8410532fa0`)	2025-10-20 09:28:13 +03:00
Nikos Dragazis	5cfbddfa43	test/boost: Add tests for SSTable compression config options Since patch `03461d6a54`, all boost unit tests depending on `cql_test_env` are compiled into a single executable (`combined_tests`). Add the new test in there. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `6ba0fa20ee`)	2025-10-20 09:28:13 +03:00
Nikos Dragazis	776e1ff055	main: Validate SSTable compression options from config `compression_parameters` provides two levels of validation: * syntactic checks - implemented in the constructor * semantic checks - implemented by `compression_parameters::validate()` The former are applied implicitly when parsing the options from the command line or from scylla.yaml. The latter are currently not applied, but they should. In lack of a better place, apply them in main, right after joining the cluster, to make sure that the cluster features have been negotiated. The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation will fail if the feature is disabled and a dictionary compression algorithm has been selected. Also, mark `validate()` as const so that it can be called from a config object. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `8d5bd212ca`)	2025-10-20 09:28:13 +03:00
Nikos Dragazis	dabe323a43	db/config: Add SSTable compression options for user tables ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size (refer to the default constructor for `compression_parameters`). The same default applies to system tables as well. Add a new configuration option to allow customizing the default for user tables. Use the previously hardcoded default as the new option's default value. Note that the option has no effect on ALTER TABLE statements. An altered table either inherits explicit compression options from the CQL statement, or maintains its existing options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `e1d9c83406`)	2025-10-20 09:27:51 +03:00
Nikos Dragazis	ae1131af97	db/config: Prepare compression_parameters for config system SSTable compression is currently configurable only per table, via the `compression` property in CREATE/ALTER TABLE statements. This is represented internally via the `compression_parameters` class. We plan to offer the same options via the configuration as well, to make the default compression method for user tables configurable. This patch prepares the ground by making the `compression_parameters` usable as a `config_file::named_value`, namely: * Define an extraction operator (required by `boost::program_options` for parsing the options from command line). * Define a formatter (required by `named_value::operator()`). * Define a template specialization for `config_type_for` (required by `named_value` constructor). * Define a yaml converter (required for parsing the options from scylla.yaml). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `a7e46974d4`)	2025-10-20 09:23:46 +03:00
Patryk Jędrzejczak	920a633a03	test: test_raft_recovery_entry_loss: fix the typo in the test case name (cherry picked from commit `71de01cd41`)	2025-10-17 10:26:59 +00:00
Patryk Jędrzejczak	5a2916dd8c	test: verify that schema pulls are disabled in the Raft-based recovery procedure We do this at the end of `test_raft_recovery_entry_loss`. It's not worth to add a separate regression test, as tests of the recovery procedure are complicated and have a long running time. Also, we choose `test_raft_recovery_entry_loss` out of all tests of the recovery procedure because it does some schema changes. (cherry picked from commit `da8748e2b1`)	2025-10-17 10:26:59 +00:00
Patryk Jędrzejczak	85a67a0f9e	raft topology: disable schema pulls in the Raft-based recovery procedure Schema pulls should always be disabled when group 0 is used. However, `migration_manager::disable_schema_pulls()` is never called during a restart with `recovery_leader` set in the Raft-based recovery procedure, which causes schema pulls to be re-enabled on all live nodes (excluding the nodes replacing the dead nodes). Moreover, schema pulls remain enabled on each node until the node is restarted, which could be a very long time. The old gossip-based recovery procedure doesn't have this problem because we disable schema pulls after completing the upgrade-to-group0 procedure, which is a part of the old recovery procedure. Fixes #26569 (cherry picked from commit `ec3a35303d`)	2025-10-17 10:26:59 +00:00
Jenkins Promoter	b34f11e52b	Update ScyllaDB version to: 2025.3.3	2025-10-15 19:22:53 +03:00
Ernest Zaslavsky	e9bdd13d1b	s3_client: track memory starvation in background filling fiber Introduce a counter metric to monitor instances where the background filling fiber is blocked due to insufficient memory in the S3 client. Closes scylladb/scylladb#26466 (cherry picked from commit `413739824f`) Closes scylladb/scylladb#26553	2025-10-15 12:05:20 +02:00
Michał Chojnowski	75f671ff18	test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test It turns out that Boost assertions are thread-unsafe, (and can't be used from multiple threads concurrently). This causes the test to fail with cryptic log corruptions sometimes. Fix that by switching to thread-safe checks. Fixes scylladb/scylladb#24982 Closes scylladb/scylladb#26472 (cherry picked from commit `7c6e84e2ec`) Closes scylladb/scylladb#26552	2025-10-15 12:13:54 +03:00
Jenkins Promoter	8714e119c9	Update pgo profiles - aarch64	2025-10-15 05:18:53 +03:00
Jenkins Promoter	a46eb49b4d	Update pgo profiles - x86_64	2025-10-15 05:04:15 +03:00
Piotr Wieczorek	445e58bbc5	alternator: Correct RCU undercount in BatchGetItem The `describe_multi_item` function treated the last reference-captured argument as the number of used RCU half units. The caller `batch_get_item`, however, expected this parameter to hold an item size. This RCU value was then passed to `rcu_consumed_capacity_counter::get_half_units`, treating the already-calculated RCU integer as if it were a size in bytes. This caused a second conversion that undercounted the true RCU. During conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH` (=4KB), so the double conversion divided the number of bytes by 16 MB. The fix removes the second conversion in `describe_multi_item` and changes the API of `describe_multi_item`. Fixes: https://github.com/scylladb/scylladb/pull/25847 Closes scylladb/scylladb#25842 (cherry picked from commit `a55c5e9ec7`) Closes scylladb/scylladb#26538	2025-10-14 11:49:30 +03:00
Avi Kivity	18796c3173	dist: scylla_raid_setup: don't override XFS block size on modern kernels In `6977064693` ("dist: scylla_raid_setup: reduce xfs block size to 1k"), we reduced the XFS block size to 1k when possible. This is because commitlog wants to write the smallest amount of padding it can, and older Linux could only write a multiple of the block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than a filesystem block. However, this doesn't play well with some SSDs that have 512 byte logical sector size and 4096 byte physical sector size - it causes them to issue read-modify-writes. To improve the situation, if we detect that the kernel is recent enough, format the filesystem with its default block size, which should be optimal. Note that commitlog will still issue sub-4k writes, which can translate to RMW. There, we believe that the amplification is reduced since sequential sub-physical-sector writes can be merged, and that the overhead from commitlog space amplification is worse than the RMW overhead. Tested on AWS i4i.large. fsqual report: ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 4096 context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` The sub-block overwrite cases are GOOD. In comparison, the fsqual report for 1k (similar): ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 1024 context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` Fixes #25441. [1] `ed1128c2d0` Closes scylladb/scylladb#25445 (cherry picked from commit `5d1846d783`) Closes scylladb/scylladb#26532	2025-10-14 11:48:58 +03:00
Michał Chojnowski	c05efcf7c6	test_sstable_compression_dictionaries_basic: reconnect robustly after node reboots Using `driver_connect()` after a cluster restart isn't enough to ensure full CQL availability, but the test assumes that it is. Fix that by making the test wait for CQL availability via `get_ready_cql()`. Also, replace some manual usages of wait_for_cql_and_get_hosts with `get_ready_cql()` too. Fixes scylladb/scylladb#25362 Closes scylladb/scylladb#25366 (cherry picked from commit `85fd4d23fa`) Closes scylladb/scylladb#26514	2025-10-12 21:08:23 +03:00
Nadav Har'El	295ed0e9e1	cql: document and test permissions on materialized views and CDC We were recently surprised (in pull request #25797) to "discover" that Scylla does not allow granting SELECT permissions on individual materialized views. Instead, all materialized views of a base table are readable if the base table is readable. In this patch we document this fact, and also add a test to verify that it is indeed true. As usual for cqlpy tests, this test can also be run on Cassandra - and it passes showing that Cassandra also implemented it the same way (which isn't surprising, given that we probably copied our initial implementation from them). The test demonstrates that neither Scylla nor Cassandra prints an error when attempting to GRANT permissions on a specific materialized view - but this GRANT is simply ignored. This is not ideal, but it is the existing behavior in both and it's not important now to change it. Additionally, because pull request #25797 made CDC-log permissions behave the same as materialized views - i.e., you need to make the base table readable to allow reading from the CDC log, this patch also documents this fact and adds a test for it also. Fixes #25800 Closes scylladb/scylladb#25827 (cherry picked from commit `3c969e2122`) Closes scylladb/scylladb#26104	2025-10-10 10:37:25 +03:00
Ernest Zaslavsky	b2dc4647dd	s3_client: fix `when` condition to prevent infinite locking Refine condition variable predicate in filling fiber to avoid indefinite waiting when `close` is invoked. Closes scylladb/scylladb#26449 (cherry picked from commit `c2bab430d7`) Closes scylladb/scylladb#26496	2025-10-10 10:30:54 +03:00
Michał Chojnowski	5ccdcb9459	docs: fix a parameter name in API calls in sstable-dictionary-compression.rst The correct argument name is `cf`, not `table`. Fixes scylladb/scylladb#25275 Closes scylladb/scylladb#26447 (cherry picked from commit `87e3027c81`) Closes scylladb/scylladb#26494	2025-10-10 10:30:39 +03:00
Pavel Emelyanov	8078ad7ee4	Merge '[Backport 2025.3] service/qos: set long timeout for auth queries on SL cache update' from Scylladb[bot] pass an appropriate query state for auth queries called from service level cache reload. we use the function qos_query_state to select a query_state based on caller context - for internal queries, we set a very long timeout. the service level cache reload is called from group0 reload. we want it to have a long timeout instead of the default 5 seconds for auth queries, because we don't have strict latency requirement on the one hand, and on the other hand a timeout exception is undesired in the group0 reload logic and can break group0 on the node. Fixes https://github.com/scylladb/scylladb/issues/25290 backport possible to improve stability - (cherry picked from commit `a1161c156f`) - (cherry picked from commit `3c3dd4cf9d`) - (cherry picked from commit `ad1a5b7e42`) Parent PR: #26180 Closes scylladb/scylladb#26478 * github.com:scylladb/scylladb: service/qos: set long timeout for auth queries on SL cache update auth: add query_state parameter to query functions auth: refactor query_all_directly_granted	2025-10-10 10:30:15 +03:00
Patryk Jędrzejczak	0a71901c4c	raft topology: make the voter handler consider only group 0 members In the Raft-based recovery procedure, we create a new group 0 and add live nodes to it one by one. This means that for some time there are nodes which belong to the topology, but not to the new group 0. The voter handler running on the recovery leader incorrectly considers these nodes while choosing voters. The consequences: - misleading logs, for example, "making servers {<ID of a non-member>} voters", where the non-member won't become a voter anyway, - increased chance of majority loss during the recovery procedure, for example, all 3 nodes that first joined the new group 0 are in the same dc and rack, but only one of them becomes a voter because the voter handler tries to make non-members in other dcs/racks voters. Fixes #26321 Closes scylladb/scylladb#26327 (cherry picked from commit `67d48a459f`) Closes scylladb/scylladb#26427	2025-10-09 18:20:36 +02:00
Michael Litvak	e05c708327	service/qos: set long timeout for auth queries on SL cache update pass an appropriate query state for auth queries called from service level cache reload. we use the function qos_query_state to select a query_state based on caller context - for internal queries, we set a very long timeout. the service level cache reload is called from group0 reload. we want it to have a long timeout instead of the default 5 seconds for auth queries, because we don't have strict latency requirement on the one hand, and on the other hand a timeout exception is undesired in the group0 reload logic and can break group0 on the node. Fixes scylladb/scylladb#25290 (cherry picked from commit `ad1a5b7e42`)	2025-10-09 12:48:08 +00:00
Michael Litvak	192ec59196	auth: add query_state parameter to query functions add a query_state parameter to several auth functions that execute internal queries. currently the queries use the internal_distributed_query_state() query state, and we maintain this as default, but we want also to be able to pass a query state from the caller. in particular, the auth queries currently use a timeout of 5 seconds, and we will want to set a different timeout when executed in some different context. (cherry picked from commit `3c3dd4cf9d`)	2025-10-09 12:48:07 +00:00
Michael Litvak	859e306e9d	auth: refactor query_all_directly_granted rewrite query_all_directly_granted to use execute_internal instead of query_internal in a style that is more consistent with the rest of the module. This will also be useful for a later change because execute_internal accepts an additional parameter of query_state. (cherry picked from commit `a1161c156f`)	2025-10-09 12:48:07 +00:00
Raphael S. Carvalho	40e8f652db	replica: Fix race between drop table and merge completion handling Consider this: 1) merge finishes, wakes up fiber to merge compaction groups 2) drop table happens, which in turn invokes truncate underneath 3) merge fiber stops old groups 4) truncate disables compaction on all groups, but the ones stopped 5) truncate performs a check that compaction has been disabled on all groups, including the ones stopped 6) the check fails because groups being stopped didn't have compaction explicitly disabled on them To fix it, the check on step 6 will ignore groups that have been stopped, since those are not eligible for having compaction explicitly disabled on them. The compaction check is there, so ongoing compaction will not propagate data being truncated, but here it happens in the context of drop table which doesn't leave anything behind. Also, a group stopped is somewhat equivalent to compaction disabled on it, since the procedure to stop a group stops all ongoing compaction and eventually removes its state from compaction manager. Fixes #25551. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#25563 (cherry picked from commit `149f9d8448`) Closes scylladb/scylladb#25632	2025-10-08 06:28:09 +03:00
Botond Dénes	2e81bc14df	Merge '[Backport 2025.3] tools: fix documentation links after change to source-available' from Scylladb[bot] Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these. Fixes: scylladb/scylladb#26320 Broken documentation link fix for the tool help output, needs backport to all live source-available versions. - (cherry picked from commit `5a69838d06`) - (cherry picked from commit `15a4a9936b`) - (cherry picked from commit `fe73c90df9`) Parent PR: #26322 Closes scylladb/scylladb#26389 * github.com:scylladb/scylladb: tools/scylla-sstable: fix doc links release: adjust doc_link() for the post source-available world tools/scylla-nodetool: remove trailing " from doc urls	2025-10-07 14:11:01 +03:00
Botond Dénes	524da18c05	tools/scylla-sstable: fix doc links The doc links in scylla-sstable help output are static, so they always point to the documentation of the latest stable release, not to the documentation of the release the tool binary is from. On top of that, the links point to old open-source documentation, which is now EOL. Fix both problems: point link at the new source-available documentation pages and make them version aware. (cherry picked from commit `fe73c90df9`)	2025-10-07 10:07:08 +03:00
Botond Dénes	0b0192b9ec	release: adjust doc_link() for the post source-available world There is no more separate enterprise product and the doc urls are slightly different. (cherry picked from commit `15a4a9936b`)	2025-10-03 14:28:44 +00:00
Botond Dénes	96a3481705	tools/scylla-nodetool: remove trailing " from doc urls They are accidental leftover from a previous way of storing command descriptions. (cherry picked from commit `5a69838d06`)	2025-10-03 14:28:44 +00:00
Benny Halevy	7341828f8f	test_tablets_merge: test_tablet_split_merge_with_many_tables: reduce number of tables in debug mode As the test hits timeouts in debug mode on aarch64. Fixes #26252 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26303 (cherry picked from commit `b81c6a339b`) Closes scylladb/scylladb#26325	2025-10-01 14:07:00 +03:00
Asias He	8b4baeb487	repair: Always reset node ops progress to 100% upon completion Always set the node ops progress to 100% when the operation finishes, regardless of success or failure. This ensures the progress never remains below 100%, which would otherwise indicates a pending node operation in case of an error. Fixes #26193 Closes scylladb/scylladb#26194 (cherry picked from commit `b31e651657`) Closes scylladb/scylladb#26267	2025-10-01 14:06:29 +03:00
Botond Dénes	c4c18fdc8d	Merge '[Backport 2025.3] replica: Fix split compaction when tablet boundaries change' from Scylladb[bot] Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. All 2025.* versions are vulnerable, so fix must be backported to them. - (cherry picked from commit `0c1587473c`) - (cherry picked from commit `68f23d54d8`) Parent PR: #25690 Closes scylladb/scylladb#25935 * github.com:scylladb/scylladb: replica: Fix split compaction when tablet boundaries change replica: Futurize split_compaction_options()	2025-10-01 14:03:41 +03:00
Jenkins Promoter	ecfe6fa332	Update pgo profiles - aarch64	2025-10-01 05:17:59 +03:00
Jenkins Promoter	ce76350fc2	Update pgo profiles - x86_64	2025-10-01 04:57:07 +03:00
Avi Kivity	1bfd52b9ea	Merge '[Backport 2025.3] scylla-gdb: Fix fair-queue entry printing' from Scylladb[bot] Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer. Fixes #26184 The issue is present in 2025.x as well and looks cheap to backport - (cherry picked from commit `8438c59ad3`) Parent PR: #26185 Also includes backport of #24835 which also applies to 2025.3 and is now crucial. The scylla_io_queues.ticket() method is renamed by this backport, but without 24835 it will be problematic to fix all callers of it Closes scylladb/scylladb#26266 * github.com:scylladb/scylladb: scylla-gdb: Fix fair-queue entry printing scylla-gdb: Don't show io_queue executing and queued resources	2025-09-30 16:32:10 +03:00
Pavel Emelyanov	c543f7f282	scylla-gdb: Fix fair-queue entry printing Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26185 (cherry picked from commit `8438c59ad3`)	2025-09-30 11:29:06 +03:00
Pavel Emelyanov	8cb3b964e0	scylla-gdb: Don't show io_queue executing and queued resources These counters are no longer accounted by io-queue code and are always zero. Even more -- accounting removal happened years ago and we don't have Scylla versions built with seastar older than that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24835	2025-09-30 11:29:06 +03:00
Raphael S. Carvalho	94b0f2fd48	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `68f23d54d8`)	2025-09-29 20:29:05 -03:00
Nikos Dragazis	18d6d94fb3	compressor: Validate presence of sstable_compression in parameters SSTable compression parameters should always define an algorithm via the `sstable_compression` sub-option. Add a check in the constructor to ensure this is always provided (unless no options are given, which is interpreted as "no compression"). This change has no user-visible effect, since the same check is already performed at a higher-level, while validating the CQL properties of CREATE TABLE and ALTER TABLE statements (see `cf_prop_defs::validate()`). However, it will become useful in later patches, when compression config options will be introduced. Although now redundant, keep the sanity check in `cf_prop_defs::validate()` to maintain consistency of error messages with Cassandra. Note also that Cassandra uses 'class' instead of 'sstable_compression' since version 3.11.10, but Scylla still doesn't support this, see: https://github.com/scylladb/scylladb/issues/4200 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `ea41f652c4`)	2025-09-28 20:01:36 +00:00
Nikos Dragazis	f6e0689461	compressor: Add missing space in exception message Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `1106157756`)	2025-09-28 20:01:36 +00:00
Ferenc Szili	9429099d5f	docs: add description of number of tablets computed by tablet allocator This change adds the documentation section which explains the algorithm to compute the absolute number of tablets a table has. Fixes: #25740 Closes scylladb/scylladb#25741 (cherry picked from commit `d462dc8839`) Closes scylladb/scylladb#26264	2025-09-28 20:29:05 +03:00
Aleksandra Martyniuk	28850ac613	test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits for the first node that logs info about repair_writer using asyncio.wait. The done group is never awaited, so we never learn about the error. The test itself is incorrect and the log about repair_writer is never printed. We never learn about that and tests finishes successfully after 10 minutes timeout. Fix the test: - disable hinted handoff; - repair tablets of the whole table: - new table is added so that concurrent migration is possible; - use wait_for_first_completed that awaits done group; - do some cleanups. Remove nightly mark. Fixes: #26148. Closes scylladb/scylladb#26209 (cherry picked from commit `48bbe09c8b`) Closes scylladb/scylladb#26220	2025-09-26 16:42:41 +03:00
Botond Dénes	6e94299ccf	Merge '[Backport 2025.3] compaction: ensure that all compaction executors are stopped' from Scylladb[bot] Currently, while stopping the compaction_manager, we stop task_manager compaction module and concurrently run compaction_manager::really_do_stop. really_do_stop stops and waits for all task_executors that are kept in compaction_manager::_tasks, but nothing ensures that no more tasks will be added there. Due to leftover tasks, we trigger on_fatal_internal_error. Modify the order of compaction_manager::stop. After the change, we stop compaction tasks in the following order: - abort module abort source; - close module gate in the background; - stop_ongoing_compactions (kept in compaction_manager::_tasks); - wait until module gate is closed. Check module abort source before creating compaction executor and adding it to _tasks. Thanks to the above, we can be sure that: - after module::stop there will be no tasks in _tasks; - compaction_manager::stop aborts all tasks; we don't wait for any whole compaction to finish. Fixes: https://github.com/scylladb/scylladb/issues/25806. Fixes shutdown bug; Needs backports to all version - (cherry picked from commit `17707d0e6b`) - (cherry picked from commit `97c77d7cd5`) Parent PR: #25885 Closes scylladb/scylladb#26224 * github.com:scylladb/scylladb: compaction: move _tasks check compaction: stop compaction module in really_do_stop	2025-09-26 13:20:52 +03:00
Gleb Natapov	6383b9009c	storage_service: change node_ops_info::ignore_nodes to host id It drop useless translation from id to ip during removenode through topology coordinator. Closes scylladb/scylladb#25958 (cherry picked from commit `d3badf7406`) Closes scylladb/scylladb#26251	2025-09-26 10:53:47 +02:00
Aleksandra Martyniuk	7c847d76f6	compaction: move _tasks check In compaction_manager::really_do_stop we check whether _tasks list is empty after the compactions are stopped. However, a new task may still sneak in, causing the assertion failure. Such a task won't be there for long - module::make_task will fail as the module is already stopped. Move the assertion, that checks if _tasks is empty, after the compaction_states' gates are closed. Fixes: #25806. (cherry picked from commit `97c77d7cd5`)	2025-09-25 15:56:00 +02:00
Aleksandra Martyniuk	9fc38c01a0	compaction: stop compaction module in really_do_stop Currently, compaction::task_manager_module is stopped in compaction_manager::stop, concurrently to really_do_stop. We can't predict the order of the two. Do not set _task_manager_module to nullptr at stop, because compaction_manager::really_do_stop() may be called before the actual shutdown, while other components still try to use it. compaction::task_manager_module does not keep a pointer to compaction_manager, so we won't end up with memory leak. Stop compaction module in really_do_stop, after ongoing compactions are stopped. It's a preparation for further patches. (cherry picked from commit `17707d0e6b`)	2025-09-25 15:55:55 +02:00
Ferenc Szili	99b69092ef	load_balancer: fix std::out_of_bounds when decommissioning with empty nodes Consider the following: The tablet load balancer is working on: - node1: an empty node (no tablets) with a large disk capacity - node2: an empty node (no tablets) with a lower disk capacity then node1 - node3: is being decommissioned and contains tablet replicas In load_balancer::make_internode_plan() the initial destination node/shard is selected like this: // Pick best target shard. auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)}; load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which adds the host to load_sketch's internal hash maps in case the node was not yet seen by load_sketch. Let's assume dst is a shard on node1. Later in load_balancer::make_internode_plan() we will call pick_candidate() to try to find a better destination node than the initial one: // May choose a different source shard than src.shard or different destination host/shard than dst. auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst, drain_skipped); auto source_tablets = candidate.tablets; src = candidate.src; dst = candidate.dst; If pick_candidate() selects some other empty destination (due to larger capacity: node1) node, and that node has not yet been seen by load_sketch (because it was empty), a subsequent call to load_sketch::pick() will search for the node using std::unordered_map::at(), and because the node is not found it will throw a std::out_of_bounds() exception crashing the load balancer. This problem is fixed by changing load_sketch::populate() to initialize its internal maps with all the nodes which populate()'s arguments filter for. Fixes: #26203 Closes scylladb/scylladb#26207 (cherry picked from commit `c6c9c316a7`) Closes scylladb/scylladb#26240	2025-09-25 09:42:47 +03:00
Dawid Mędrek	c7091b61e4	db/batchlog: Drop batch if table has been dropped If there are pending mutations in the batchlog for a table that has been dropped, we'll keep attempting to replay them but with no success -- `db::no_such_column_family` exceptions will be thrown, and we'll keep trying again and again. To prevent that, we drop the batch in that case just like we do in the case of a non-existing keyspace. A reproducer test has been included in the commit. It fails without the changes in `db/batchlog_manager.cc`, and it succeeds with them. Fixes scylladb/scylladb#24806 Closes scylladb/scylladb#26057 (cherry picked from commit `35f7d2aec6`) Closes scylladb/scylladb#26201	2025-09-25 09:39:55 +03:00
Andrzej Jackowski	07c21c06a4	test: audit: stop using datetime.datetime.now() in syslog converter `line_to_row` is a test function that converts `syslog` audit log to the format of `table` audit log so tests can use the same checks for both types of audit. Because `syslog` audit doesn't have `date` information, the field was filled with the current date. This behavior broke the tests running at 23:59:59 because `line_to_row` returned different results on different days. Fixes: scylladb/scylladb#25509 Closes scylladb/scylladb#26101 (cherry picked from commit `15e71ee083`) Closes scylladb/scylladb#26191	2025-09-25 09:37:45 +03:00
Dawid Mędrek	f4de31a316	test/perf/tablet_load_balancing.cc: Create nodes within one DC In `789a4a1ce7`, we adjusted the test file to work with the configuration option `rf_rack_valid_keyspaces`. Part of the commit was making the two tables used in the test replicate in separate data centers. Unfortunately, that destroyed the point of the test because the tables no longer competed for resources. We fix that by enforcing the same replication factor for both tables. We still accept different values of replication factor when provided manually by the user (by `--rf1` and `--rf2` commandline options). Scylla won't allow for creating RF-rack-invalid keyspaces, but there's no reason to take away the flexibility the user of the test already has. Fixes scylladb/scylladb#26026 Closes scylladb/scylladb#26115 (cherry picked from commit `0d2560c07f`) Closes scylladb/scylladb#26172	2025-09-25 09:34:29 +03:00
Ferenc Szili	4a2ba1dbde	docs: add capacity based balancing explanation Capacity based balancing was introduced in 2025.1. It computes balance based on a node's capacity: the number of tablets located on a node should be directly proportional to that node's storage capacity. This change adds this explanation to the docs. Fixes: #25686 Closes scylladb/scylladb#25687 (cherry picked from commit `de5dab8429`) Closes scylladb/scylladb#26107	2025-09-25 09:31:40 +03:00
Botond Dénes	6fe8f98add	Merge '[Backport 2025.3] compaction/scrub: register sstables for compaction before validation' from Scylladb[bot] compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 This reported scrub failure occurs on all versions that have the checksum/digest validation feature for uncompressed sstables. So, backport it to older versions. - (cherry picked from commit `84f2e99c05`) - (cherry picked from commit `7cdda510ee`) Parent PR: #26034 Closes scylladb/scylladb#26099 * github.com:scylladb/scylladb: compaction/scrub: register sstables for compaction before validation compaction/scrub: handle exceptions when moving invalid sstables to quarantine	2025-09-25 09:27:31 +03:00
Pavel Emelyanov	702eda371b	s3: Add metrics to show S3 prefetch bytes The chunked download source sends large GET requests and then consumes data as it arrives. Sometimes it can stop reading from socket early and drop the in-flight data. The existing read-bytes metrics show only the number of consumed bytes, we we also want to know the number of requested bytes Refs #25770 (accounting of read-bytes) Fixes #25876 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25877 (cherry picked from commit `6fb66b796a`) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26070	2025-09-25 09:26:41 +03:00
Raphael S. Carvalho	ec225e08d1	replica: Futurize split_compaction_options() Prepararation for the fix of #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `0c1587473c`)	2025-09-24 20:38:08 -03:00
Patryk Jędrzejczak	d653e710ba	test: deflake driver reconnections in the recovery procedure tests All three tests could hit https://github.com/scylladb/python-driver/issues/295. We use the standard workaround for this issue: reconnecting the driver after the rolling restart, and before sending any requests to local tables (that can fail if the driver closes a connection to the node that restarted last). All three tests perform two rolling restarts, but the latter ones already have the workaround. Fixes #26005 Closes scylladb/scylladb#26056 (cherry picked from commit `a56115f77b`) Closes scylladb/scylladb#26199	2025-09-24 11:52:00 +02:00
Tomasz Grabiec	b3f4bef36b	tablets: scheduler: Run plan-maker in maintenance scheduling group Currently, it runs in the gossiper scheduling group, because it's invoked by the topology coordinator. That scheduling group has the same amount of shares as user workload. Plan-making can take significant amount of time during rebalancing, and we don't want that to impact user workload which happens to run on the same shard. Reduce impact by running in the maintenance scheduling group. Fixes #26037 Closes scylladb/scylladb#26046 (cherry picked from commit `ddbcea3e2a`) Closes scylladb/scylladb#26168	2025-09-22 15:20:01 +02:00
Pavel Emelyanov	b4598031e6	s3: Fix chunked download source metrics calculations In S3 client both read and write metrics have three counters -- number of requests made, number of bytes processed and request latency. In most of the cases all three counters are updated at once -- upon response arrival. However, in case of chunked download source this way of accounting metrics is misleading. In this code the request is made once, and then the obtained bytes are consumed eventually as the data arrive. Currently, each time a new portion of data is read from the socket the number of read requests is incremented. That's wrong, the request is made once, and this counter should also be incremented once, not for every data buffer that arrived in response. Same for read request latency -- it's "added" for every data buffer that arrives, but it's a lenghy process, the _request_ latency should be accounted once per responce. Maybe later we'll want to have "data latency" metrics as well, but for what we have now it's request latency. The number of read bytes is accounted properly, so not touched here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25770 (cherry picked from commit `9deea3655f`) Closes scylladb/scylladb#26145	2025-09-22 07:35:03 +03:00
Asias He	04776ad19e	streaming: Enclose potential throws in try block and ensure sink close before logging - Move the initialization of log_done inside the try block to catch any exceptions it may throw. - Relocate the failure warning log after sink.close() cleanup to guarantee sink.close() is always called before logging errors. Refs #25497 Closes scylladb/scylladb#25591 (cherry picked from commit `b12404ba52`) Closes scylladb/scylladb#25903	2025-09-21 18:11:43 +03:00
Nadav Har'El	d61bce8685	alternator: fix bug in combination of AttributeUpdates + ReturnValues In test/alternator/test_returnvalues.py we had tests for the ReturnValues feature on UpdateItem requests - but we only tested UpdateItem requests with the "modern" UpdateExpression, and forgot to test the combination of ReturnValues with the old AttributeUpdates API. It turns out this combination is buggy: when both ReturnValues=ALL_OLD and AttributeUpdates need the previous value of the item, we may wrongly std::move() the value out, and the operation will fail with a strange error: An error occurred (ValidationException) when calling the UpdateItem operation: JSON assert failed on condition 'IsObject()' The fix in this patch is trivial - just move the std::move() to the correct place, after both UpdateExpression and AttributeUpdates handling is done. This patch also includes a reproducing test, which fails before this patch and passes with it - and of course passes on DynamoDB. This test reproduces two cases where the bug happened, as well as one case where it didn't (to make sure we don't regress in what already worked). Fixes #25894 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25900 (cherry picked from commit `3c0032deb4`) Closes scylladb/scylladb#26096	2025-09-19 19:25:15 +03:00
Lakshmi Narayanan Sreethar	6e94a73fd4	compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `7cdda510ee`) Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-19 18:38:54 +05:30
Lakshmi Narayanan Sreethar	20501b2ea3	compaction/scrub: handle exceptions when moving invalid sstables to quarantine In validate mode, scrub moves invalid sstables into the quarantine folder. If validation fails because the sstable files are missing from disk, there is nothing to move, and the quarantine step will throw an exception. Handle such exceptions so scrub can return a proper compaction_result instead of propagating the exception to the caller. This will help the testcase for #23363 to reliably determine if the scrub has failed or not. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `84f2e99c05`)	2025-09-19 18:35:31 +05:30
Szymon Malewski	cca78c6568	alternator/expressions.g: Fix antlr3 missing token leak This patch overrides the antlr3 function that allocates the missing tokens that would eventually leak. The override stores these tokens in a vector, ensuring memory is freed whenever the parser is destroyed. Solution is copied from CQL implementation. A unit test to reproduce the issue is added - leak would be reported by ASAN, when running this test in debug mode - the test passed but the leak is discovered when the test file exits. Fixes #25878 Closes scylladb/scylladb#25930 (cherry picked from commit `776f90e2f8`) Closes scylladb/scylladb#26085	2025-09-18 07:50:31 +03:00
Sergey Zolotukhin	8568a8a303	raft: disable caching for raft log. This change disables caching for raft log table due to the following reasons: * Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls. * Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls. Fixes scylladb/scylladb#26027 Closes scylladb/scylladb#26031 (cherry picked from commit `2640b288c2`) Closes scylladb/scylladb#26074	2025-09-18 07:50:05 +03:00
Pavel Emelyanov	1310e61040	Merge '[Backport 2025.3] gossiper: ensure gossiper operations are executed in gossiper scheduling group' from Scylladb[bot] Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures. This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group. Fixes scylladb/scylladb#25907 Refs: scylladb/scylladb#25702 Backport: this patch fixes an issue with gossiper operations scheduling group, that might affect topology operations, therefore backport is needed to 2025.1, 2025.2, 2025.3 - (cherry picked from commit `340413e797`) - (cherry picked from commit `6c2a145f6c`) Parent PR: #25981 Closes scylladb/scylladb#26073 * https://github.com/scylladb/scylladb: gossiper: ensure gossiper operations are executed in gossiper scheduling group gossiper: fix wrong gossiper instance used in `force_remove_endpoint`	2025-09-18 07:49:49 +03:00
Aleksandra Martyniuk	3f345615a5	replica: lower severity of failure log Flush failure with seastar::named_gate_closed_exception is expected if a respective compaction group was already stopped. Lower the severity of a log in dirty_memory_manager::flush_one for this exception. Fixes: https://github.com/scylladb/scylladb/issues/25037. Closes scylladb/scylladb#25355 (cherry picked from commit `a10e241228`) Closes scylladb/scylladb#25650	2025-09-18 07:49:28 +03:00
Sergey Zolotukhin	3bf986170b	gossiper: ensure gossiper operations are executed in gossiper scheduling group Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures. This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group. Fixes scylladb/scylladb#25907 (cherry picked from commit `6c2a145f6c`)	2025-09-17 11:22:31 +00:00
Sergey Zolotukhin	d585211c4a	gossiper: fix wrong gossiper instance used in `force_remove_endpoint` `gossiper::force_remove_endpoint` is always executed on shard 0 using `invoke_on`. Since each shard has its own `gossiper` instance, if `force_remove_endpoint` is called from a shard other than shard 0, `my_host_id()` may be invoked on the wrong `gossiper` object. This results in undefined behavior due to unsynchronized access to resources on another shard. (cherry picked from commit `340413e797`)	2025-09-17 11:22:31 +00:00
Wojciech Mitros	246fcb8b6a	mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the view. Ghost rows are rows in the view with no corresponding row in the base table. Before this patch, only rows whose primary key columns of the base table had different values than any of the base rows were treated as ghost rows by the PRUNE statement. However, view rows which have a column in their primary key that's not in the base primary can also be ghost rows if this column has a different value than the base row with the same values of remaining primary key columns. That's because these rows won't be deleted unless we change value of this column in the base table to this specific value. In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic. If this column isn't the same in the base table and the view, these rows are also deleted. Fixes https://github.com/scylladb/scylladb/issues/25655 Closes scylladb/scylladb#25720 (cherry picked from commit `1f9be235b8`) Closes scylladb/scylladb#25956	2025-09-15 12:26:02 +02:00
Jenkins Promoter	93da39020f	Update ScyllaDB version to: 2025.3.2	2025-09-15 11:12:31 +03:00
Jenkins Promoter	04b0d7b629	Update pgo profiles - aarch64	2025-09-15 05:35:35 +03:00
Jenkins Promoter	92d0b05bd0	Update pgo profiles - x86_64	2025-09-15 05:04:20 +03:00
Patryk Jędrzejczak	b5cbe0d50a	Merge '[Backport 2025.3] test: cluster: deflake consistency checks after decommission' from Scylladb[bot] In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). Therefore, `check_token_ring_and_group0_consistency` called just after decommission might fail when the decommissioned node is still in group 0 (as a non-voter). We deflake all tests that call `check_token_ring_and_group0_consistency` after decommission in this PR. Fixes #25809 This PR improves CI stability and changes only tests, so it should be backported to all supported branches. - (cherry picked from commit `e41fc841cd`) - (cherry picked from commit `bb9fb7848a`) Parent PR: #25927 Closes scylladb/scylladb#25963 * https://github.com/scylladb/scylladb: test: cluster: deflake consistency checks after decommission test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency	2025-09-11 13:01:54 +02:00
Patryk Jędrzejczak	2ce95c429f	test: cluster: deflake consistency checks after decommission In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). Therefore, `check_token_ring_and_group0_consistency` called just after decommission might fail when the decommissioned node is still in group 0 (as a non-voter). We deflake all tests that call `check_token_ring_and_group0_consistency` after decommission in this commit. Fixes #25809 (cherry picked from commit `bb9fb7848a`)	2025-09-10 17:49:12 +00:00
Patryk Jędrzejczak	b4e64e5adf	test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't handle such a case; it only handles cases where the token ring is updated later. We fix this in this commit. We rely on the new implementation of `wait_for_token_ring_and_group0_consistency` in the following commit to fix flakiness of some tests. We also update the obsolete docstring in this commit. (cherry picked from commit `e41fc841cd`)	2025-09-10 17:49:12 +00:00
Dawid Mędrek	3dac49c62f	test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity We modify the logic to make sure that all of the keyspaces that the test creates are RF-rack-valid. For that, we distribute the nodes across two DCs and as many racks as the provided replication factor. That may have an effect on the load balancing logic, but since this is a performance test and since tablet load balancing is still taking place, it should be acceptable. This commit also finishes work in adjusting perf tests to pass with the `rf_rack_valid_keyspaces` configuration option enabled. The remaining tests either don't attempt to create keyspaces or they already create RF-rack-valid keyspaces. We don't need to explicitly enable the configuration option. It's already enabled by default by `cql_test_config`. The reason why we haven't run into any issue because of that is that performance tests are not part of our CI. Fixes scylladb/scylladb#25127 Closes scylladb/scylladb#25728 (cherry picked from commit `789a4a1ce7`) Closes scylladb/scylladb#25922	2025-09-10 10:30:40 +03:00
Asias He	ac88ea8152	streaming: Fix use after move in the tablet_stream_files_handler The files object is moved before the log when stream finishes. We've logged the files when the stream starts. Skip it in the end of streaming. Fixes #25830 Closes scylladb/scylladb#25835 (cherry picked from commit `451e1ec659`) Closes scylladb/scylladb#25891	2025-09-10 10:30:11 +03:00
Wojciech Mitros	055a6c2cee	storage_proxy: send hints to pending replicas Consider the following scenario: - Current replica set is [A, B, C] - write succeeds on [A, B], and a hint is logged for node C - before the hint is replayed, D bootstraps and the token migrates from C to D - hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done - C is cleaned up, replayed data is lost, and D has a stale copy until next repair. In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets, as it can happen for every tablet migration. This issue is particularly detrimental to materialized views. View updates use hints by default and a specific view update may be sent to just one view replica (when a single base replica has a different row state due to reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table. To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original target is still alive. This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't be too common either. The scenarios for them are: 1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will arrive on the pending replica anyway in streaming 2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to the actual source of the migration, the pending replica will get it during streaming 3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending replica for the hint so we'll send it multiple times 4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the pending replica, so we need to retry the entire write This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are in the same rack. We also add a test case reproducing the issue. Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes https://github.com/scylladb/scylladb/issues/19835 Closes scylladb/scylladb#25590 (cherry picked from commit `10b8e1c51c`) Closes scylladb/scylladb#25882	2025-09-10 10:29:52 +03:00
Pavel Emelyanov	81e4c65f8c	Merge '[Backport 2025.3] Allow users to SELECT from CDC log tables they created.' from Scylladb[bot] Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created. It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not the base table, on which the user had SELECT permission automatically granted on CREATE. This patch matches the behavior of querying the CDC log to the one implemented for Materialized Views: 1. No new permissions are granted on CREATE. 2. When querying SELECT, the permissions on base table SELECT are checked. Fixes: https://github.com/scylladb/scylladb/issues/19798 Fixes: VECTOR-151 - (cherry picked from commit `be54346846`) - (cherry picked from commit `5e72d71188`) Parent PR: #25797 Closes scylladb/scylladb#25870 * github.com:scylladb/scylladb: cqlpy/test_permissions: run the reproducer tests for #19798 select_statement: check for access to CDC base table	2025-09-10 10:29:10 +03:00
Pavel Emelyanov	6977c5eaf1	s3: Export memory usage gauge (metrics) The memory usage is tracked with the help of a semaphore, so just export its "consumed" units. One tricky place here is the need to skip metrics registration for scylla-sstable tool. The thing is that the tools starts the storage manager and sstables manager on start and then some of tool's operations may want to start both managers again (via cql environment) causing double metrics registration exception. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25769 (cherry picked from commit `b26816f80d`) Closes scylladb/scylladb#25865	2025-09-10 10:28:39 +03:00
Yaron Kaikov	bdec3b2bc5	build_docker.sh: enable debug symboles installation Adding the latest scylla.repo location to our docker container, this will allow installation scylla-debuginfo package in case it's needed Fixes: https://github.com/scylladb/scylladb/issues/24271 Closes scylladb/scylladb#25646 (cherry picked from commit `d57741edc2`) Closes scylladb/scylladb#25893	2025-09-09 11:41:17 +03:00
Patryk Jędrzejczak	2792fd6383	Merge '[Backport 2025.3] gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Scylladb[bot] Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart. Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. Fixes https://github.com/scylladb/scylladb/issues/25831 Fixes https://github.com/scylladb/scylladb/issues/25803 Fixes https://github.com/scylladb/scylladb/issues/25702 Fixes https://github.com/scylladb/scylladb/issues/25621 Ref https://github.com/scylladb/scylla-enterprise/issues/5613 Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3. - (cherry picked from commit `28e0f42a83`) - (cherry picked from commit `f08df7c9d7`) - (cherry picked from commit `775642ea23`) - (cherry picked from commit `b34d543f30`) Parent PR: #25849 Closes scylladb/scylladb#25898 * https://github.com/scylladb/scylladb: gossiper: fix empty initial local node state gossiper: add test for a race condition in start_gossiping gossiper: check for a race condition in `do_apply_state_locally` test/gossiper: add reproducible test for race condition during node decommission	2025-09-09 10:00:30 +02:00
Sergey Zolotukhin	41dd29f5a3	gossiper: fix empty initial local node state This change removes the addition of an empty state to `_endpoint_state_map`. Instead, a new state is created locally and then published via replicate, avoiding the issue of an empty state existing in `_endpoint_state_map` before the preemption point. Since this resolves the issue tested in `test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed. Fixes: scylladb/scylladb#25831 (cherry picked from commit `b34d543f30`)	2025-09-08 21:55:16 +00:00
Sergey Zolotukhin	13f43e2872	gossiper: add test for a race condition in start_gossiping This change adds a test for a race condition in `start_gossiping` that can lead to an empty self state sent in `gossip_get_endpoint_states_response`. Test for scylladb/scylladb#25831 (cherry picked from commit `775642ea23`)	2025-09-08 21:55:16 +00:00
Sergey Zolotukhin	ec85ebf419	gossiper: check for a race condition in `do_apply_state_locally` In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change 1. adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. 2. Removes xfail from the test_gossiper_race test since the issue is now fixed. 3. Adds exception handling in `do_shadow_round` to skip responses from nodes that sent an empty host ID. This re-applies the commit `13392a40d4` that was reverted in `46aa59fe49`, after fixing the issues that caused the CI to fail. Fixes: scylladb/scylladb#25702 Fixes: scylladb/scylladb#25621 Ref: scylladb/scylla-enterprise#5613 (cherry picked from commit `f08df7c9d7`)	2025-09-08 21:55:16 +00:00
Emil Maskovsky	b53a5f9b3d	test/gossiper: add reproducible test for race condition during node decommission This change introduces a targeted test that simulates the gossiper race condition observed during node decommissioning. The test delays gossip state application and host ID lookup to reliably reproduce the scenario where `gossiper::get_host_id()` is called on a removed endpoint, potentially triggering an abort in `apply_new_states`. There is a specific error injection added to widen the race window, in order to increase the likelihood of hitting the race condition. The error injection is designed to delay the application of gossip state updates, for the specific node that is being decommissioned. This should then result in the server abort in the gossiper. This re-applies the commit `5dac4b38fb` that was reverted in `dc44fca67c`, but modified to relax the check from "on_internal_error" to a just warning log. The more strict can be re-introduced later once we are sure that all remaining problems are resolved and it will not break the CI. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25721 (cherry picked from commit `28e0f42a83`)	2025-09-08 21:55:16 +00:00
Anna Stuchlik	acd4cbbbe1	doc: add support for i7i instances This commit adds currently supported i7i and i7ie instances to the list of instance recommendations. Fixes https://github.com/scylladb/scylladb/issues/25808 Closes scylladb/scylladb#25817 (cherry picked from commit `f66580a28f`) Closes scylladb/scylladb#25853	2025-09-08 10:40:52 +03:00
Dawid Pawlik	4303bb7d56	cqlpy/test_permissions: run the reproducer tests for #19798 Since the previous commit fixes the issue, we can remove the xfail mark. The tests should pass now. (cherry picked from commit `5e72d71188`)	2025-09-08 07:39:52 +00:00
Dawid Pawlik	675f74b4b7	select_statement: check for access to CDC base table Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created. It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not thebase table, on which the user had SELECT permission automatically granted on CREATE. This patch matches the behaviour of querying the CDC log to the one implemented for Materialized Views: 1. No new permissions are granted on CREATE. 2. When querying SELECT, the permissions on base table SELECT are checked. Fixes: #19798 (cherry picked from commit `be54346846`)	2025-09-08 07:39:52 +00:00
Avi Kivity	0900a88884	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start (cherry picked from commit `c762425ea7`)	2025-09-07 13:38:33 +03:00
Calle Wilund	2bbf3cf669	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699 (cherry picked from commit `bc20861afb`) Closes scylladb/scylladb#25815	2025-09-05 19:02:39 +03:00
Botond Dénes	c30c1ec40a	Merge '[Backport 2025.3] drop table: fix crash on drop table with concurrent cleanup' from Scylladb[bot] Consider the following scenario: - A tablet is migrated away from a shard - The tablet cleanup stage closes the storage group's async_gate - A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate - Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash Fixes: #25706 This needs to be backported to all supported versions with tablets - (cherry picked from commit `a0934cf80d`) - (cherry picked from commit `1b8a44af75`) Parent PR: #25708 Closes scylladb/scylladb#25785 * github.com:scylladb/scylladb: test: reproducer and test for drop with concurrent cleanup truncate: check for closed storage group's gate in discard_sstables	2025-09-05 19:02:04 +03:00
Andrei Chekun	2ee1082561	test.py: modify run to use different junit output filenames Currently, run will execute twice pytest without modifying the path of the JUnit XML report. This leads that the second execution of the pytest will override the report. This PR fixing this issue so both reports will be stored. Closes scylladb/scylladb#25726 (cherry picked from commit `e55c8a9936`) Closes scylladb/scylladb#25778	2025-09-05 19:01:22 +03:00
Pavel Emelyanov	f1e3dedcd6	Revert "test/gossiper: add reproducible test for race condition during node decommission" This reverts commit `4e17330a1b` because parent PR had been reverted as per #25803	2025-09-05 10:08:29 +03:00
Nadav Har'El	5d6aa6e8c2	utils, alternator: fix detection of invalid base-64 This patch fixes an error-path bug in the base-64 decoding code in utils/base64.cc, which among other things is used in Alternator to decode blobs in JSON requests. The base-64 decoding code has a lookup table, which was wrongly sized 255 bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF) was included in an invalid base-64 string, instead of detecting that this is an invalid byte (since the only valid bytes in a base-64 string are A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a nonsense 6-bit part, or even crash on an out-of-bounds read. Besides the trivial fix, this patch also includes a reproducing test, which tries to write a blob as a supposedly base-64 encoded string with a 0xFF byte in it. The test fails before this patch (the write succeeds, unexpectedly), and passes after this patch (the write fails as expected). The test also passes on DynamoDB. Fixes #25701 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25705 (cherry picked from commit `ff91027eac`) Closes scylladb/scylladb#25767	2025-09-04 11:38:55 +03:00
Pavel Emelyanov	1c8e10231a	Merge '[Backport 2025.3] service/qos: Modularize service level controller to avoid invalid access to auth::service' from Scylladb[bot] Move management over effective service levels from `service_level_controller` to a new dedicated type -- `auth_integration`. Before these changes, it was possible for the service level controller to try to access `auth::service` after it was deinitialized. For instance, it could happen when reloading the cache. That HAS happened as described in the following issue: scylladb/scylladb#24792. Although the problem might have been mitigated or even resolved in scylladb/scylladb@10214e13bd, it's not clear how the service will be used in the future. It's best to prevent similar bugs than trying to fix them later on. The logic responsible for preventing to access an uninitialized `auth::service` was also either non-existent, complex, or non-sufficient. To prevent accessing `auth::service` by the service level controller, we extract the relevant portion of the code to a separate entity -- `auth_integration`. It's an internal helper type whose sole purpose is to manage effective service levels. Thanks to that, we were able to nest the lifetime of `auth_integration` within the lifetime of `auth::service`. It's now impossible to attempt to dereference it while it's uninitialized. If a bug related to an invalid access is spotted again, though, it might also be easier to debug it now. There should be no visible change to the users of the interface of the service level controller. We strived to make the patch minimal, and the only affected part of the logic should be related to how `auth::service` is accessed. The relevant portion of the initialization and deinitialization flow: (a) Before the changes: 1. Initialize `service_level_controller`. Pass a reference to an uninitialized `auth::service` to it. 2. Initialize other services. 3. Initialize and start `auth::service`. 4. (work) 5. Stop and deinitialize `auth::service`. 6. Deinitialize other services. 7. Deinitialize `service_level_controller`. (b) After the changes: 1. Initialize `service_level_controller`. Pass a reference to an uninitialized `auth::service` to it. () 2. Initialize other services. 3. Initialize and start `auth::service`. 4. Initialize `auth_integration`. Register it in `service_level_controller`. 5. (work) 6. Unregister `auth_integration` in `service_level_controller` and deinitialize it. 7. Stop and deinitialize `auth::service`. 8. Deinitialize other services. 9. Deinitialize `service_level_controller`. (): The reference to `auth::service` in `service_level_controller` is still necessary. We need to access the service when dropping a distributed service level. Although it would be best to cut that link between the service level controller and `auth::service` too, effectively separating the entities, it would require more work, so we leave it as-is for now. It shouldn't prove problematic as far as accessing an uninitialized service goes. Trying to drop a service level at the point when we're de-initializing auth should be impossible. For more context, see the function `drop_distributed_service_level` in `service_level_controller`. A trivial test has been included in the PR. Although its value is questionable as we only try to reload the service level cache at a specific moment, it's probably the best we can deliver to provide a reproducer of the issue this patch is resolving. Fixes scylladb/scylladb#24792 Backport: The impact of the bug was minimal as it only affected the shutdown. However, since CI is failing because of it, let's backport the change to all supported versions. - (cherry picked from commit `7d0086b093`) - (cherry picked from commit `34afb6cdd9`) - (cherry picked from commit `e929279d74`) - (cherry picked from commit `dd5a35dc67`) - (cherry picked from commit `fc1c41536c`) Parent PR: #25478 Closes scylladb/scylladb#25753 * github.com:scylladb/scylladb: service/qos: Move effective SL cache to auth_integration service/qos: Add auth::service to auth_integration service/qos: Reload effective SL cache conditionally service/qos: Add gate to auth_integration service/qos: Introduce auth_integration	2025-09-04 11:38:17 +03:00
Pavel Emelyanov	d484837a2a	Merge '[Backport 2025.3] db/hints: Improve logs' from Scylladb[bot] Before these changes, the logs in hinted handoff often didn't provide crucial information like the identifier of the node that hints were being sent to. Also, some of the logs were misleading and referred to other places in the code than the one where an exception or some other situation really occurred. We modify those logs, extending them by more valuable information and fixing existing issues. What's more, all of the logs in `hint_endpoint_manager` and `hint_sender` follow a consistent format now: ``` <class_name>[<destination host ID>]:<function_name>: <message> ``` This way, we should always have AT LEAST the basic information. Fixes scylladb/scylladb#25466 Backport: There is no risk in backporting these changes. They only have impact on the logs. On the other hand, they might prove helpful when debugging an issue in hinted handoff. - (cherry picked from commit `2327d4dfa3`) - (cherry picked from commit `d7bc9edc6c`) - (cherry picked from commit `6f1fb7cfb5`) Parent PR: #25470 Closes scylladb/scylladb#25538 * github.com:scylladb/scylladb: db/hints: Add new logs db/hints: Adjust log levels db/hints: Improve logs	2025-09-04 11:36:30 +03:00
Pavel Emelyanov	ad6dbcfdc5	Merge '[Backport 2025.3] generic server: 2 step shutdown' from Scylladb[bot] This PR implements solution proposed in scylladb/scylladb#24481 Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections. The updated shutdown process is as follows: 1. Initial Shutdown Phase * Close the accept gate to block new incoming connections. * Abort all accept() calls. * For all active connections: * Close only the input side of the connection to prevent new requests. * Keep the output side open to allow responses to be sent. 2. Drain Phase * Wait for all in-progress requests to either complete or fail. 3. Final Shutdown Phase * Fully close all connections. Fixes scylladb/scylladb#24481 - (cherry picked from commit `122e940872`) - (cherry picked from commit `3848d10a8d`) - (cherry picked from commit `3610cf0bfd`) - (cherry picked from commit `27b3d5b415`) - (cherry picked from commit `061089389c`) - (cherry picked from commit `7334bf36a4`) - (cherry picked from commit `ea311be12b`) - (cherry picked from commit `4f63e1df58`) Parent PR: #24499 Closes scylladb/scylladb#25519 * github.com:scylladb/scylladb: test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. generic_server: Two-step connection shutdown. transport: consmetic change, remove extra blanks. transport: Handle sleep aborted exception in sleep_until_timeout_passes generic_server: replace empty destructor with `= default` generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. test: Add test for query execution during CQL server shutdown	2025-09-04 11:35:55 +03:00
Ran Regev	a79cbd9a9a	docs: backup and restore feature added backup and restore as a feature to documentation Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#25608 (cherry picked from commit `515d9f3e21`) Closes scylladb/scylladb#25748	2025-09-03 12:37:45 +03:00
Emil Maskovsky	4e17330a1b	test/gossiper: add reproducible test for race condition during node decommission This change introduces a targeted test that simulates the gossiper race condition observed during node decommissioning. The test delays gossip state application and host ID lookup to reliably reproduce the scenario where `gossiper::get_host_id()` is called on a removed endpoint, potentially triggering an abort in `apply_new_states`. There is a specific error injection added to widen the race window, in order to increase the likelihood of hitting the race condition. The error injection is designed to delay the application of gossip state updates, for the specific node that is being decommissioned. This should then result in the server abort in the gossiper. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25721 Backport: The test is primarily for an issue found in 2025.1, so it needs to be backported to all the 2025.x branches. Closes scylladb/scylladb#25685 (cherry picked from commit `5dac4b38fb`) Closes scylladb/scylladb#25781	2025-09-02 08:29:27 +02:00
Ferenc Szili	6a7a5f5edc	test: reproducer and test for drop with concurrent cleanup This change adds a reproducer and test for issue #25706 (cherry picked from commit `1b8a44af75`)	2025-09-02 02:18:56 +00:00
Ferenc Szili	34b403747a	truncate: check for closed storage group's gate in discard_sstables Consider the following scenario: - A tablet is migrated away from a shard - The tablet cleanup stage closes the storage group's async_gate - A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate - Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash This patch makes dicard_sstables check if the storage group's gate is closed whend checking for disabled compaction. (cherry picked from commit `a0934cf80d`)	2025-09-02 02:18:56 +00:00
Piotr Dulikowski	debc637ac1	Merge '[Backport 2025.3] system_keyspace: add peers cache to get_ip_from_peers_table' from Scylladb[bot] The gossiper can call `storage_service::on_change` frequently (see scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues. This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL. This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620. Fixes scylladb/scylladb#25660 backport: this patch needs to be backported to all supported versions (2025.1/2/3). - (cherry picked from commit `91c633371e`) - (cherry picked from commit `de5dc4c362`) - (cherry picked from commit `4b907c7711`) Parent PR: #25658 Closes scylladb/scylladb#25766 * github.com:scylladb/scylladb: storage_service: move get_host_id_to_ip_map to system_keyspace system_keyspace: use peers cache in get_ip_from_peers_table storage_service: move get_ip_from_peers_table to system_keyspace	2025-09-01 21:21:26 +02:00
Taras Veretilnyk	ddb7c8ea12	keys: from_nodetool_style_string don't split single partition keys Users with single-column partition keys that contain colon characters were unable to use certain REST APIs and 'nodetool' commands, because the API split key by colon regardless of the partition key schema. Affected commands: - 'nodetool getendpoints' - 'nodetool getsstables' Affected endpoints: - '/column_family/sstables/by_key' - '/storage_service/natural_endpoints' Refs: #16596 - This does not fully fix the issue, as users with compound keys will face the issue if any column of the partition key contains a colon character. Closes scylladb/scylladb#24829 Closes scylladb/scylladb#25565	2025-09-01 15:36:56 +03:00
Petr Gusev	c4386c2aa4	storage_service: move get_host_id_to_ip_map to system_keyspace Reimplemented the function to use the peers cache. It could be replaced with get_ip_from_peers_table, but that would create a coroutine frame for each call. (cherry picked from commit `4b907c7711`)	2025-09-01 11:22:55 +02:00
Petr Gusev	7ec3e166c6	system_keyspace: use peers cache in get_ip_from_peers_table The storage_service::on_change method can be called quite often by the gossiper, see scylladb/scylla-enterprise#5613. In this commit we introduce a temporal cache for system.peers so that we don't have to go to the storage each time we need to resolve host_id -> ip. We keep the cache only for a small amount of time to handle the (unlikely) scenario when the user wants to update system.peers table from CQL. Fixes scylladb/scylladb#25660 (cherry picked from commit `de5dc4c362`)	2025-09-01 11:22:05 +02:00
Petr Gusev	5f8664757a	storage_service: move get_ip_from_peers_table to system_keyspace We plan to add a cache to get_ip_from_peers_table in upcoming commits. It's more convenient to do this from system_keyspace, since the only two methods that mutate system.peers (remove_endpoint and update_peers_info) are already there. (cherry picked from commit `91c633371e`)	2025-09-01 11:21:55 +02:00
Calle Wilund	2e08d651a8	system_keyspace: Limit parallelism in drop_truncation_records Fixes #25682 Refs scylla-enterprise#5580 If the truncation table is large in entries, we might create a huge parallel execution, quite possibly consuming loads of resources doing something quite trivial. Limit concurrency to a small-ish number Closes scylladb/scylladb#25678 (cherry picked from commit `2eccd17e70`) Closes scylladb/scylladb#25751	2025-09-01 09:13:44 +03:00
Emil Maskovsky	05f8b0d543	storage: pass host_id as parameter to `maybe_reconnect_to_preferred_ip()` Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID using `gossiper::get_host_id()`. Since the host ID is already available in the calling function, we now pass it directly as a parameter. This change simplifies the code and eliminates a potential race condition where `gossiper::get_host_id()` could fail, as described in scylladb/scylladb#25621. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25715 Backport: Recommended for 2025.x release branches to avoid potential issues from unnecessary calls to `gossiper::get_host_id()` in subscribers. (cherry picked from commit `cfc87746b6`) Closes scylladb/scylladb#25718	2025-09-01 09:13:21 +03:00
kendrick-ren	d7a36c6d8e	Update launch-on-gcp.rst Add the missing '=' mark in --zone option. Otherwise the command complains. Closes scylladb/scylladb#25471 (cherry picked from commit `d6e62aeb6a`) Closes scylladb/scylladb#25645	2025-09-01 09:11:21 +03:00
Benny Halevy	2a6791d246	api: storage_service: fix token_range documentation Note that the token_range type is used only by describe_ring. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25609 (cherry picked from commit `45c496c276`) Closes scylladb/scylladb#25640	2025-09-01 09:11:12 +03:00
Pavel Emelyanov	95d953b7b9	Merge '[Backport 2025.3] cql3: Warn when creating RF-rack-invalid keyspace' from Scylladb[bot] Although RF-rack-valid keyspaces are not universally enforced yet (they're governed by the configuration option `rf_rack_valid_keyspaces`), we'd like to encourage the user to abide by the restriction. To that end, we're introducing a warning when creating or altering a keyspace. If the configuration option is disabled, but the user is trying to create an RF-rack-invalid keyspace, they'll receive a warning. If the option is turned off, we will also log all of the RF-rack-invalid keyspaces at start-up. We provide validation tests. Fixes scylladb/scylladb#23330 Backport: we'd like to encourage the user to abide by the restriction even when they don't enforce it to make it easier in the future to adjust the schema when there's no way to disable it anymore. Because of that, we'd like to backport it to all relevant versions, starting with 2025.1. - (cherry picked from commit `60ea22d887`) - (cherry picked from commit `af8a3dd17b`) - (cherry picked from commit `837d267cbf`) Parent PR: #24785 Closes scylladb/scylladb#25635 * github.com:scylladb/scylladb: main: Log RF-rack-invalid keyspaces at startup cql3/statements: Fix indentation cql3: Warn when creating RF-rack-invalid keyspace	2025-09-01 09:11:01 +03:00
David Garcia	3db935f30f	docs: expose alternator metrics Renders in the docs some metrics introduced in https://github.com/scylladb/scylladb/pull/24046/files that were not being displayed in https://docs.scylladb.com/manual/stable/reference/metrics.html Closes scylladb/scylladb#25561 (cherry picked from commit `c3c70ba73f`) Closes scylladb/scylladb#25629	2025-09-01 09:10:41 +03:00
Michał Chojnowski	05ea29ee8d	sstables/types.hh: fix fmt::formatter<sstables::deletion_time> Obvious typo. Fixes scylladb/scylladb#25556 Closes scylladb/scylladb#25557 (cherry picked from commit `c1b513048c`) Closes scylladb/scylladb#25588	2025-09-01 09:10:21 +03:00
Dawid Mędrek	7f58681482	db/commitlog: Extend error messages for corrupted data We're providing additional information in error messages when throwing an exception related to data corruption: when a segment is truncated and when it's content is invalid. That might prove helpful when debugging. Closes scylladb/scylladb#25190 (cherry picked from commit `408b45fa7e`) Closes scylladb/scylladb#25461	2025-09-01 09:08:29 +03:00
Andrei Chekun	8163f4edaa	test.py: use unique hostname for Minio To avoid situation that port is occupied on localhost, use unique hostname for Minio (cherry picked from commit `c6c3e9f492`) Closes scylladb/scylladb#24775	2025-09-01 08:59:00 +03:00
Pavel Emelyanov	0c6c507704	Merge '[Backport 2025.3] test.py: add missed parameters that should be passed from test.py to pytest' from Scylladb[bot] Several parameters that `test.py` should pass to pytest->boost were missing. This PR adds handling these parameters: `--random-seed` and `--x-log2-compaction-groups` Since this code affected with this issue in 2025.3 and this is only framework change, backport for that version needed. Fixes: https://github.com/scylladb/scylladb/issues/24927 - (cherry picked from commit `71b875c932`) - (cherry picked from commit `f7c7877ba6`) Parent PR: #24928 Closes scylladb/scylladb#25035 * github.com:scylladb/scylladb: test.py: add bypassing x_log2_compaction_groups to boost tests test.py: add bypassing random seed to boost tests	2025-09-01 08:58:48 +03:00
Jenkins Promoter	3da82e8572	Update pgo profiles - aarch64	2025-09-01 05:24:53 +03:00
Jenkins Promoter	16c7bd4c6e	Update pgo profiles - x86_64	2025-09-01 05:01:14 +03:00
Jenkins Promoter	3c922d68f0	Update ScyllaDB version to: 2025.3.1	2025-08-31 11:05:24 +03:00
Calle Wilund	fe87af4674	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719 (cherry picked from commit `cc9eb321a1`) Closes scylladb/scylladb#25756	2025-08-30 18:50:47 +03:00
Dawid Mędrek	5c947a936e	service/qos: Move effective SL cache to auth_integration Since `auth_integration` manages effective service levels, let's move the relevant cache from `service_level_controller` to it. (cherry picked from commit `fc1c41536c`)	2025-08-29 22:58:21 +00:00
Dawid Mędrek	fcdd21948d	service/qos: Add auth::service to auth_integration The new service, `auth_integration`, has taken over the responsibility over managing effective service levels from `service_level_controller`. However, before these changes, it still accessed `auth::service` via the service level controller. Let's change that. Note that we also remove a check that `auth::service` has been initialized. It's not necessary anymore because the lifetime of `auth_integration` is strictly nested within the lifetime of `auth::service`. In actuality, `service_level_controller` should lose its reference to `auth::service` completely. All of the management over effective service levels has already been moved to `auth_integration`. However, the referernce is still needed when dropping a distributed service level because we need to update the corresponding attribute for relevant roles. That should not lead to invalid accesses, though. Dropping a service level should not be possible when `auth::service` is not initialized. (cherry picked from commit `dd5a35dc67`)	2025-08-29 22:58:21 +00:00
Dawid Mędrek	aea5805c1f	service/qos: Reload effective SL cache conditionally Since `service_level_controller` outlives `auth_integration`, it may happen that we try to access it when it has already been deinitialized. To prevent that, we only try to reload or clear the effective service level cache when the object is still alive. These changes solve an existing problem with an invalid memory access. For more context, see issue scylladb/scylladb#24792. We provide a reproducer test that consistently fails before these changes but passes after them. Fixes scylladb/scylladb#24792 (cherry picked from commit `e929279d74`)	2025-08-29 22:58:20 +00:00
Dawid Mędrek	753305763a	service/qos: Add gate to auth_integration We add a named gate to `auth_integration` that will aid us in synchronizing ongoing tasks with stopping the service. (cherry picked from commit `34afb6cdd9`)	2025-08-29 22:58:20 +00:00
Dawid Mędrek	4b69c74385	service/qos: Introduce auth_integration We introduce a new type, `auth_integration`, that will be used internally by `service_level_controller`. Its purpose is to take over the responsibility over managing effective service levels. The main problem of the current implementation of service level controller is its dependency on `auth::service` whose lifetime is strictly nested within the lifetime of service level controller. That may and already have led to invalid memory accesses; for an example, see issue scylladb/scylladb#24792. Our strategy is to split service level controller into smaller parts and ensure that we access `auth::service` only when it's valid to do so. This commit is the first step towards that. We don't change anything in the logic yet, just add the new type. Further adjustments will be made in following commits. (cherry picked from commit `7d0086b093`)	2025-08-29 22:58:20 +00:00
Jenkins Promoter	d9e492a90c	Update ScyllaDB version to: 2025.3.0	2025-08-27 14:38:30 +03:00
Andrei Chekun	6ee92600e2	test.py: add bypassing x_log2_compaction_groups to boost tests Bypassing argument to pytest->boost that was missing. (cherry picked from commit `f7c7877ba6`)	2025-08-25 15:15:30 +02:00
Andrei Chekun	c55919242d	test.py: add bypassing random seed to boost tests Bypassing argument to pytest->boost that was missing. Fixes: https://github.com/scylladb/scylladb/issues/24927 (cherry picked from commit `71b875c932`)	2025-08-25 15:14:52 +02:00
Dawid Mędrek	9652a1260f	main: Log RF-rack-invalid keyspaces at startup When the configuration option `rf_rack_valid_keyspaces` is enabled and there is an RF-rack-invalid keyspace, starting a node fails. However, when the configuration option is disabled, but there still is a keyspace that violates the condition, we'd like Scylla to print a warning informing the user about the fact. That's what happens in this commit. We provide a validation test. (cherry picked from commit `837d267cbf`)	2025-08-22 14:31:49 +00:00
Dawid Mędrek	5f13044627	cql3/statements: Fix indentation (cherry picked from commit `af8a3dd17b`)	2025-08-22 14:31:49 +00:00
Dawid Mędrek	cd795170b4	cql3: Warn when creating RF-rack-invalid keyspace Although RF-rack-valid keyspaces are not universally enforced yet (they're governed by the configuration option `rf_rack_valid_keyspaces`), we'd like to encourage the user to abide by the restriction. To that end, we're introducing a warning when creating or altering a keyspace. If the configuration option is disabled, but the user is trying to create an RF-rack-invalid keyspace, they'll receive a warning. We provide a validation test. (cherry picked from commit `60ea22d887`)	2025-08-22 14:31:49 +00:00
Ferenc Szili	acb542606e	test: remove test_tombstone_gc_disabled_on_pending_replica The test test_tombstone_gc_disabled_on_pending_replica was added when we fixed (#20788) the potential problem with data resurrection during file based streaming. The issue was occurring only in Enterprise, but we added the fix in OSS to limit code divergence. This test was added together with the fix in OSS with the idea to guard this change in OSS. The real reproducer and test for this fix was added later, after the fix was ported into Enterprise. It is in: test/cluster/test_resurrection.py Since Enterprise has been merged into OSS, there is no more need to keep the test test_tombstone_gc_disabled_on_pending_replica. Also, it is flaky with very low probability of failure, making it difficult to investigate the cause of failure. Fixes: #22182 Refs: scylladb/scylladb#25448 Closes scylladb/scylladb#25134 (cherry picked from commit `7ce96345bf`) Closes scylladb/scylladb#25573	2025-08-19 16:01:22 +03:00
Piotr Dulikowski	8bd92d4dd0	Merge '[Backport 2025.3] test: test_mv_backlog: fix to consider internal writes' from Scylladb[bot] The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics. The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match. The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time. Fixes https://github.com/scylladb/scylladb/issues/23139 backport to improve CI stability - test only change - (cherry picked from commit `5c28cffdb4`) - (cherry picked from commit `276a09ac6e`) Parent PR: #25279 Closes scylladb/scylladb#25475 * github.com:scylladb/scylladb: test: test_mv_backlog: fix to consider internal writes test/pylib/rest_client: fix ScyllaMetrics filtering	2025-08-19 09:48:01 +02:00
Patryk Jędrzejczak	e631d2e872	test: test_maintenance_socket: use cluster_con for driver sessions The test creates all driver sessions by itself. As a consequence, all sessions use the default request timeout of 10s. This can be too low for the debug mode, as observed in scylladb/scylla-enterprise#5601. In this commit, we change the test to use `cluster_con`, so that the sessions have the request timeout set to 200s from now on. Fixes scylladb/scylla-enterprise#5601 This commit changes only the test and is a CI stability improvement, so it should be backported all the way to 2024.2. 2024.1 doesn't have this test. Closes scylladb/scylladb#25510 (cherry picked from commit `03cc34e3a0`) Closes scylladb/scylladb#25547	2025-08-18 16:41:03 +02:00
Dawid Mędrek	d12fdcaa75	db/hints: Add new logs We're adding new logs in just a few places that may however prove important when debugging issues in hinted handoff in the future. (cherry picked from commit `6f1fb7cfb5`)	2025-08-18 16:02:01 +02:00
Dawid Mędrek	325831afad	db/hints: Adjust log levels Some of the logs could be clogging Scylla's logs, so we demote their level to a lower one. On the other hand, some of the logs would most likely not do that, and they could be useful when debugging -- we promote them to debug level. (cherry picked from commit `d7bc9edc6c`)	2025-08-18 16:02:00 +02:00
Dawid Mędrek	7b212edd0c	db/hints: Improve logs Before these changes, the logs in hinted handoff often didn't provide crucial information like the identifier of the node that hints were being sent to. Also, some of the logs were misleading and referred to other places in the code than the one where an exception or some other situation really occurred. We modify those logs, extending them by more valuable information and fixing existing issues. What's more, all of the logs in `hint_endpoint_manager` and `hint_sender` follow a consistent format now: ``` <class_name>[<destination host ID>]:<function_name>: <message> ``` This way, we should always have AT LEAST the basic information. (cherry picked from commit `2327d4dfa3`)	2025-08-18 16:01:57 +02:00
Sergey Zolotukhin	bad157453b	test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. In debug mode, queries may sometimes take longer than the default 30 seconds. To address this, the timeout value `request_timeout_on_shutdown_in_seconds` during tests is aligned with other request timeouts. Change request timeout for tests from 180s to 90s since we must keep the request timeout during shutdown significantly lower than the graceful shutdown timeout(2m), or else a request timeout would cause a graceful shutdown timeout and fail a test. (cherry picked from commit `4f63e1df58`)	2025-08-18 15:47:08 +02:00
Sergey Zolotukhin	9b7886ed71	generic_server: Two-step connection shutdown. When shutting down in `generic_server`, connections are now closed in two steps. First, only the RX (receive) side is shut down. Then, after all ongoing requests are completed, or a timeout happened the connections are fully closed. Fixes scylladb/scylladb#24481 (cherry picked from commit `ea311be12b`)	2025-08-18 15:46:46 +02:00
Sergey Zolotukhin	e2aed2e860	transport: consmetic change, remove extra blanks. (cherry picked from commit `7334bf36a4`)	2025-08-18 14:55:16 +02:00
Anna Stuchlik	977a4a110a	doc: add support for RHEL 10 This commit adds RHEL 10 to the list of supported platforms. Fixes https://github.com/scylladb/scylladb/issues/25436 Closes scylladb/scylladb#25437 (cherry picked from commit `1322f301f6`) Closes scylladb/scylladb#25447	2025-08-18 12:20:19 +03:00
Wojciech Przytuła	666985bbe0	Fix link to ScyllaDB manual The link would point to outdated OS docs. I fixed it to point to up-to-date Enterprise docs. Closes scylladb/scylladb#25328 (cherry picked from commit `7600ccfb20`) Closes scylladb/scylladb#25486	2025-08-15 13:31:06 +03:00
Wojciech Mitros	bb6e681b58	test: run mv tests depending on metrics on a standalone instance The test_base_partition_deletion_with_metrics test case (and the batch variant) uses the metric of view updates done during its runtime to check if we didn't perform too many of them. The test runs in the cqlpy suite, which runs all test cases sequentially on one Scylla instance. Because of this, if another test case starts a process which generates view updates and doesn't wait for it to finish before it exists, we may observe too many view updates in test_base_partition_deletion_with_metrics and fail the test. In all test cases we make sure that all tables that were created during the test are dropped at the end. However, that doesn't stop the view building process immediately, so the issue can happen even if we drop the view. I confirmed it by adding a test just before test_base_partition_deletion_with_metrics which builds a big materialized view and drops it at the end - the metrics check still failed. The issue could be caused by any of the existing test cases where we create a view and don't wait for it to be built. Note that even if we start adding rows after creating the view, some of them may still be included in the view building, as the view building process is started asynchronously. In such a scenario, the view building also doesn't cause any issues with the data in these tests - writes performed after view creation generate view updates synchronously when they're local (and we're running a single Scylla server), the corresponding view udpates generated during view building are redundant. Because we have many test cases which could be causing this issue, instead of waiting for the view building to finish in every single one of them, we move the susceptible test cases to be run on separate Scylla instances, in the "cluster" suite. There, no other test cases will influence the results. Fixes https://github.com/scylladb/scylladb/issues/20379 Closes scylladb/scylladb#25209 (cherry picked from commit `2ece08ba43`) Closes scylladb/scylladb#25504	2025-08-15 13:30:53 +03:00
Ernest Zaslavsky	8a017834a0	s3_client: add memory fallback in `chunked_download_source` Introduce fallback logic in `chunked_download_source` to handle memory exhaustion. When memory is low, feed the `deque` with only one uncounted buffer at a time. This allows slow but steady progress without getting stuck on the memory semaphore. Fixes: https://github.com/scylladb/scylladb/issues/25453 Fixes: https://github.com/scylladb/scylladb/issues/25262 Closes scylladb/scylladb#25452 (cherry picked from commit `dd51e50f60`) Closes scylladb/scylladb#25511	2025-08-15 13:30:38 +03:00
Anna Stuchlik	d8d5ab1032	doc: document support for new z3 instance types This commit adds new z3 instances we now support to the list of GCP instance types. Fixes https://github.com/scylladb/scylladb/issues/25438 Closes scylladb/scylladb#25446 (cherry picked from commit `841ba86609`) Closes scylladb/scylladb#25512	2025-08-15 13:30:11 +03:00
Andrzej Jackowski	82ee1bf9cb	test: audit: add logging of get_audit_log_list and set_of_rows_before Without those logs, analysing some test failures is difficult. Refs: scylladb/scylladb#25442 Closes scylladb/scylladb#25485 (cherry picked from commit `bf8be01086`) Closes scylladb/scylladb#25514	2025-08-15 13:29:56 +03:00
Abhinav Jha	5e018831f8	raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet loss named name_drops. The framework makes hard coded assumptions about leader which doesn't hold well in case of packet losses. This short term fix disables the packet drop variant of the specified test. It should be safe to re-enable it once the whole framework is re-worked to remove these hard coded assumptions. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23816 Closes scylladb/scylladb#25489 (cherry picked from commit `a0ee5e4b85`) Closes scylladb/scylladb#25528	2025-08-15 13:29:42 +03:00
Jenkins Promoter	a162e0256e	Update pgo profiles - aarch64	2025-08-15 05:28:08 +03:00
Jenkins Promoter	adbbbf87c3	Update pgo profiles - x86_64	2025-08-15 05:05:35 +03:00
Sergey Zolotukhin	e2dcd559b6	transport: Handle sleep aborted exception in sleep_until_timeout_passes In PR #23156, a new function `sleep_until_timeout_passes` was introduced to wait until a read request times out or completes. However, the function did not handle cases where the sleep is aborted via _abort_source, which could result in WARN messages like "Exceptional future is ignored" during shutdown. This change adds proper handling for that exception, eliminating the warning. (cherry picked from commit `061089389c`)	2025-08-14 13:22:36 +00:00
Sergey Zolotukhin	665530e479	generic_server: replace empty destructor with `= default` This change improves code readability by explicitly marking the destructor as defaulted. (cherry picked from commit `27b3d5b415`)	2025-08-14 13:22:36 +00:00
Sergey Zolotukhin	d729529226	generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` This change improves logging and modifies the behavior to attempt closing the output side of a connection even if an error occurs while closing the input side. (cherry picked from commit `3610cf0bfd`)	2025-08-14 13:22:36 +00:00
Sergey Zolotukhin	2fef421534	generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. The functions are just wrappers for _fd.shutdown_input() and _fd.shutdown_output(), with added error reporting. Needed by later changes. (cherry picked from commit `3848d10a8d`)	2025-08-14 13:22:36 +00:00
Sergey Zolotukhin	0f99fa76de	test: Add test for query execution during CQL server shutdown This test simulates a scenario where a query is being executed while the query coordinator begins shutting down the CQL server and client connections. The shutdown process should wait until the query execution is either completed or timed out. Test for scylladb/scylladb#24481 (cherry picked from commit `122e940872`)	2025-08-14 13:22:36 +00:00
Ernest Zaslavsky	c70ba8384e	s3_client: make memory semaphore acquisition abortable Add `abort_source` to the `get_units` call for the memory semaphore in the S3 client, allowing the acquisition process to be aborted. Fixes: https://github.com/scylladb/scylladb/issues/25454 Closes scylladb/scylladb#25469 (cherry picked from commit `380c73ca03`) Closes scylladb/scylladb#25499	2025-08-14 10:34:28 +02:00
Michael Litvak	761f722b6f	test: test_mv_backlog: fix to consider internal writes The test executes a single write, fetching metrics before and after the write, and expects the total throttled writes count to be increased exactly by one. However, other internal writes (compaction for example) may be executed during this time and be throttled, causing the metrics to be increased by more than expected. To address this, we filter the metrics by the scheduling group label of the user write, to filter out the compaction writes that run in the compaction scheduling group. Fixes scylladb/scylladb#23139 (cherry picked from commit `276a09ac6e`)	2025-08-12 14:51:10 +00:00
Michael Litvak	3a3b5bb14c	test/pylib/rest_client: fix ScyllaMetrics filtering In the ScyllaMetrics `get` function, when requesting the value for a specific shard, it is expected to return the sum of all values of metrics for that shard that match the labels. However, it would return the value of the first matching line it finds instead of summing all matching lines. For example, if we have two lines for one shard like: some_metric{scheduling_group_name="compaction",shard="0"} 1 some_metric{scheduling_group_name="sl:default",shard="0"} 2 The result of this call would be 1 instead of 3: get('some_metric', shard="0") We fix this to sum all matching lines. The filtering of lines by labels is fixed to allow specifying only some of the labels. Previously, for the line to match the filter, either the filter needs to be empty, or all the labels in the metric line had to be specified in the filter parameter and match its value, which is unexpected, and breaks when more labels are added. We also simplify the function signature and the implementation - instead of having the shard as a separate parameter, it can be specified as a label, like any other label. (cherry picked from commit `5c28cffdb4`)	2025-08-12 14:51:09 +00:00
Patryk Jędrzejczak	b999aa85b9	docs: Raft recovery procedure: recommend verifying participation in Raft recovery This instruction adds additional safety. The faster we notice that a node didn't restart properly, the better. The old gossip-based recovery procedure had a similar recommendation to verify that each restarting node entered `RECOVERY` mode. Fixes #25375 This is a documentation improvement. We should backport it to all branches with the new recovery procedure, so 2025.2 and 2025.3. Closes scylladb/scylladb#25376 (cherry picked from commit `7b77c6cc4a`) Closes scylladb/scylladb#25440	2025-08-11 15:49:20 +02:00
Anna Stuchlik	a655c0e193	doc: add new and removed metrics to the 2025.3 upgrade guide This commit adds the list of new and removed metrics to the already existing upgrade guide from 2025.2 to 2025.3. Fixes https://github.com/scylladb/scylladb/issues/24697 Closes scylladb/scylladb#25385 (cherry picked from commit `f3d9d0c1c7`) Closes scylladb/scylladb#25416	2025-08-11 06:56:31 +03:00
Botond Dénes	9775b2768b	Merge '[Backport 2025.3] GCP Key Provider: Fix authentication issues' from Scylladb[bot] * Fix discovery of application default credentials by using fully expanded pathnames (no tildes). * Fix grant type in token request with user credentials. Fixes #25345. - (cherry picked from commit `77cc6a7bad`) - (cherry picked from commit `b1d5a67018`) Parent PR: #25351 Closes scylladb/scylladb#25407 * github.com:scylladb/scylladb: encryption: gcp: Fix the grant type for user credentials encryption: gcp: Expand tilde in pathnames for credentials file	2025-08-11 06:52:38 +03:00
Botond Dénes	2048ac88f1	Merge '[Backport 2025.3] test.py: native pytest repeats' from Scylladb[bot] Previous way of execution repeat was to launch pytest for each repeat. That was resource consuming, since each time pytest was doing discovery of the tests. Now all repeats are done inside one pytest process. Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well. Fixes: https://github.com/scylladb/scylladb/issues/25391 - (cherry picked from commit `cc75197efd`) - (cherry picked from commit `557293995b`) - (cherry picked from commit `853bdec3ec`) - (cherry picked from commit `d0e4045103`) Parent PR: #25073 Closes scylladb/scylladb#25392 * github.com:scylladb/scylladb: test.py: add repeats in pytest test.py: add directories and filename to the log files test.py: rename log sink file for boost tests test.py: better error handling in boost facade	2025-08-11 06:51:55 +03:00
Szymon Malewski	4c375b257b	test/alternator: enable more relevant logs in CI. This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%. This affects running alternator tests only with `test.py`, not with `test/alternator/run`. Closes #24645 Closes scylladb/scylladb#25327 (cherry picked from commit `eb11485969`) Closes scylladb/scylladb#25383	2025-08-11 06:51:23 +03:00
Botond Dénes	ea6d0c880a	Merge '[Backport 2025.3] test: audit: ignore cassandra user audit logs in AUTH tests' from Scylladb[bot] Audit tests are vulnerable to noise from LOGIN queries (because AUTH audit logs can appear at any time). Most tests already use the `filter_out_noise` mechanism to remove this noise, but tests focused on AUTH verification did not, leading to sporadic failures. This change adds a filter to ignore AUTH logs generated by the default "cassandra" user, so tests only verify logs from the user created specifically for each test. Additionally, this PR: - Adds missing `nonlocal new_rows` statement that prevented some checks from being called - Adds a testcase for audit logs of `cassandra` user Fixes: https://github.com/scylladb/scylladb/issues/25069 Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`. - (cherry picked from commit `e634a2cb4f`) - (cherry picked from commit `daf1c58e21`) - (cherry picked from commit `aef6474537`) - (cherry picked from commit `21aedeeafb`) Parent PR: #25111 Closes scylladb/scylladb#25140 * github.com:scylladb/scylladb: test: audit: add cassandra user test case test: audit: ignore cassandra user audit logs in AUTH tests test: audit: change names of `filter_out_noise` parameters	2025-08-11 06:49:36 +03:00
Andrei Chekun	a4ea7b42c8	test.py: add repeats in pytest Previous way of executin repeat was to launch pytest for each repeat. That was resource consuming, since each time pytest was doing discovery of the tests. Now all repeats are done inside one pytest process. (cherry picked from commit `d0e4045103`)	2025-08-08 15:27:25 +02:00
Benny Halevy	9a34622a47	scylla-sstable: print_query_results_json: continue loop if row is disengaged Otherwise it is accessed right when exiting the if block. Add a unit test reproducing the issue and validating the fix. Fixes #25325 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25326 (cherry picked from commit `5e5e63af10`) Closes scylladb/scylladb#25379	2025-08-08 11:43:34 +03:00
Nikos Dragazis	8838d8df5f	encryption: gcp: Fix the grant type for user credentials Exchanging a refresh token for an access token requires the "refresh_token" grant type [1]. [1] https://datatracker.ietf.org/doc/html/rfc6749#section-6 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `b1d5a67018`)	2025-08-07 21:46:24 +00:00
Nikos Dragazis	a69afb0d0b	encryption: gcp: Expand tilde in pathnames for credentials file The GCP host searches for application default credentials in known locations within the user's home directory using `seastar::file_exists()`. However, this function does not perform tilde expansion in pathnames. Replace tildes with the home directory from the HOME environment variable. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> (cherry picked from commit `77cc6a7bad`)	2025-08-07 21:46:24 +00:00
Andrei Chekun	8766750228	test.py: add directories and filename to the log files Currently, only test function name used for output and log files. For better clarity adding the relative path from the test directory of the file name without extension to these files. Before: test_aggregate_avg.1.log test_aggregate_avg_stdout.1.log After: boost.aggregate_fcts_test.test_aggregate_avg.1.log boost.aggregate_fcts_test.test_aggregate_avg_stdout.3.log (cherry picked from commit `853bdec3ec`)	2025-08-07 10:46:58 +00:00
Andrei Chekun	c4cefc5195	test.py: rename log sink file for boost tests Log sink is outputted in XML format not just simple text file. Renaming to have better clarity (cherry picked from commit `557293995b`)	2025-08-07 10:46:58 +00:00
Andrei Chekun	5f8e69a5d9	test.py: better error handling in boost facade If test was not executed for some reason, for example not known parameter passed to the test, but boost framework was able to finish correctly, log file will have data but it will be parsed to an empty list. This will raise an exception in pytest execution, rather than produce test output. This change will handle this situation. (cherry picked from commit `cc75197efd`)	2025-08-07 10:46:58 +00:00
Avi Kivity	0d54b72f21	Merge '[Backport 2025.3] truncate: change check for write during truncate into a log warning' from Scylladb[bot] TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail. This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that: - all data written before TRUNCATE starts is deleted - none of the data after TRUNCATE completes is deleted Fixes: #25173 Fixes: #25013 Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1 - (cherry picked from commit `268ec72dc9`) - (cherry picked from commit `33488ba943`) Parent PR: #25174 Closes scylladb/scylladb#25350 * github.com:scylladb/scylladb: truncate: add test for truncate with concurrent writes truncate: change check for write during truncate into a log warning	2025-08-07 12:19:45 +03:00
Andrzej Jackowski	1be1306233	test: audit: add cassandra user test case Audit tests use the `filter_out_noise` function to remove noise from audit logs generated by user authentication. As a result, none of the existing tests covered audit logs for the default `cassandra` user. This change adds a test case for that user. Refs: scylladb/scylladb#25069 (cherry picked from commit `21aedeeafb`)	2025-08-07 10:03:27 +02:00
Patryk Jędrzejczak	1863386bc8	Merge '[Backport 2025.3] Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Scylladb[bot] The following steps are performed in sequence as part of the Raft-based recovery procedure: - set `recovery_leader` to the host ID of the recovery leader in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - perform a rolling restart (with the recovery leader being restarted first). These steps are not intuitive and more complicated than they could be. In this PR, we simplify these steps. From now on, we will be able to simply set `recovery_leader` on each node just before restarting it. Apart from making necessary changes in the code, we also update all tests of the Raft-based recovery procedure and the user-facing documentation. Fixes scylladb/scylladb#25015 The Raft-based procedure was added in 2025.2. This PR makes the procedure simpler and less error-prone, so it should be backported to 2025.2 and 2025.3. - (cherry picked from commit `ec69028907`) - (cherry picked from commit `445a15ff45`) - (cherry picked from commit `23f59483b6`) - (cherry picked from commit `ba5b5c7d2f`) - (cherry picked from commit `9e45e1159b`) - (cherry picked from commit `f408d1fa4f`) Parent PR: #25032 Closes scylladb/scylladb#25335 * https://github.com/scylladb/scylladb: docs: document the option to set recovery_leader later test: delay setting recovery_leader in the recovery procedure tests gossip: add recovery_leader to gossip_digest_syn db: system_keyspace: peers_table_read_fixup: remove rows with null host_id db/config, gms/gossiper: change recovery_leader to UUID db/config, utils: allow using UUID as a config option	2025-08-07 09:58:13 +02:00
Taras Veretilnyk	606db56cf3	docs: Sort commands list in nodetool.rst Fixes scylladb/scylladb#25330 Closes scylladb/scylladb#25331 (cherry picked from commit `bcb90c42e4`) Closes scylladb/scylladb#25372	2025-08-06 20:49:21 +03:00
Nikos Dragazis	26174a9c67	test: kmip: Fix segfault from premature destruction of port_promise `kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP server for a particular Boost test case. The function runs the server as an external process and uses a thread to parse the port from the server's logs. The thread communicates the port to the main thread via a promise. The current implementation has a bug where the thread may set a value to the promise after its destruction, causing a segfault. This happens when the server does not start within 20 seconds, in which case the port future throws and the stack unwinding machinery destroys the port promise before the thread that writes to it. Fix the bug by declaring the promise before the cleanup action. The bug has been encountered in CI runs on slow machines, where the PyKMIP server takes too long to create its internal tables (due to slow fdatasync calls from SQLite). This patch does not improve CI stability - it only ensures that the error condition is properly reflected in the test output. This patch is not a backport. The same bug has been fixed in master as part of a larger rewrite of the `kmip_test_helper()` (see `722e2bce96`). Refs #24747, #24842. Fixes #24574. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#25030	2025-08-06 11:59:40 +03:00
Pavel Emelyanov	4fcf0a620c	Merge '[Backport 2025.3] Simplify credential reload: remove internal expiration checks' from Scylladb[bot] This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm. To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does. Fixes: https://github.com/scylladb/scylladb/issues/25044 Should be backported to 2025.3 since we need this fix for the restore - (cherry picked from commit `68855c90ca`) - (cherry picked from commit `e4ebe6a309`) - (cherry picked from commit `837475ec6f`) Parent PR: #24961 Closes scylladb/scylladb#25347 * github.com:scylladb/scylladb: s3_creds: code cleanup s3_creds: Make `reload` unconditional s3_creds: Add test exposing credentials renewal issue	2025-08-06 11:33:19 +03:00
Aleksandra Martyniuk	2282a11405	tasks: change _finished_children type Parent task keeps a vector of statuses (task_essentials) of its finished children. When the children number is large - for example because we have many tables and a child task is created for each table - we may hit oversize allocation while adding a new child essentials to the vector. Keep task_essentails of children in chunked_vector. Fixes: #25040. Closes scylladb/scylladb#25064 (cherry picked from commit `b5026edf49`) Closes scylladb/scylladb#25319	2025-08-06 07:36:04 +03:00
Michał Jadwiszczak	c31f47026d	storage_service, group0_state_machine: move SL cache update from `topology_state_load()` to `load_snapshot()` Currently the service levels cache is unnecessarily updated in every call of `topology_state_load()`. But it is enough to reload it only when a snapshot is loaded. (The cache is also already updated when there is a change to one of `service_levels_v2`, `role_members`, `role_attributes` tables.) Fixes scylladb/scylladb#25114 Fixes scylladb/scylladb#23065 Closes scylladb/scylladb#25116 (cherry picked from commit `10214e13bd`) Closes scylladb/scylladb#25305	2025-08-06 07:27:48 +03:00
Aleksandra Martyniuk	4f0e5bf429	api: storage_service: do not log the exception that is passed to user The exceptions that are thrown by the tasks started with API are propagated to users. Hence, there is no need to log it. Remove the logs about exception in user started tasks. Fixes: https://github.com/scylladb/scylladb/issues/16732. Closes scylladb/scylladb#25153 (cherry picked from commit `e607ef10cd`) Closes scylladb/scylladb#25298	2025-08-06 07:27:05 +03:00
Andrei Chekun	85769131a2	docs: update documentation with new way of running C++ tests Documentation had outdated information how to run C++ test. Additionally, some information added about gathered test metrics. Closes scylladb/scylladb#25180 (cherry picked from commit `a6a3d119e8`) Closes scylladb/scylladb#25291	2025-08-06 07:25:44 +03:00
Dawid Mędrek	c5e1e28076	test: Enable RF-rack-valid keyspaces in all Python suites We're enabling the configuration option `rf_rack_valid_keyspaces` in all Python test suites. All relevant tests have been adjusted to work with it enabled. That encompasses the following suites: * alternator, * broadcast_tables, * cluster (already enabled in scylladb/scylladb@ee96f8dcfc), * cql, * cqlpy (already enabled in scylladb/scylladb@be0877ce69), * nodetool, * rest_api. Two remaining suites that use tests written in Python, redis and scylla_gdb, are not affected, at least not directly. The redis suite requires creating an instance of Scylla manually, and the tests don't do anything that could violate the restriction. The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but even then it reuses the `run` file from the cqlpy suite. Fixes scylladb/scylladb#25126 Closes scylladb/scylladb#24617 (cherry picked from commit `b41151ff1a`) Closes scylladb/scylladb#25231	2025-08-06 07:17:40 +03:00
Tomasz Grabiec	71dd30fc25	topology_coordinator: Trigger load stats refresh after replace Otherwise, tablet rebuilt will be delayed for up to 60s, as the tablet scheduler needs load stats for the new node (replacing) to make decisisons. Fixes #25163 Closes scylladb/scylladb#25181 (cherry picked from commit `55116ee660`) Closes scylladb/scylladb#25216	2025-08-06 07:16:45 +03:00
Ferenc Szili	db3777c703	truncate: add test for truncate with concurrent writes test_validate_truncate_with_concurrent_writes checks if truncate deletes all the data written before the truncate starts, and does not delete any data after truncate completes. (cherry picked from commit `33488ba943`)	2025-08-06 00:52:15 +00:00
Ferenc Szili	0248f555da	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013 (cherry picked from commit `268ec72dc9`)	2025-08-06 00:52:15 +00:00
Ernest Zaslavsky	88779a6884	s3_creds: code cleanup Remove unnecessary code which is no more used (cherry picked from commit `837475ec6f`)	2025-08-06 00:50:36 +00:00
Ernest Zaslavsky	ccdc98c8f0	s3_creds: Make `reload` unconditional Assume that any caller invoking `reload` intends to refresh credentials. Remove conditional logic that checks for expiration before reloading. (cherry picked from commit `e4ebe6a309`)	2025-08-06 00:50:36 +00:00
Ernest Zaslavsky	bd79ae3826	s3_creds: Add test exposing credentials renewal issue Add a test demonstrating that renewing credentials does not update their expiration. After requesting credentials again, the expiration remains unchanged, indicating no actual update occurred. (cherry picked from commit `68855c90ca`)	2025-08-06 00:50:36 +00:00
Botond Dénes	f212f6af28	Merge 'repair: distribute tablet_repair_task_metas between shards' from Aleksandra Martyniuk Currently, in repair_service::repair_tablets a shard that initiates the repair keeps repair_tablet_metas of all tablets that have a replica on this node (on any shard). This may lead to oversized allocations. Modify tablet_repair_task_impl to repair only the tablets which replicas are kept on this shard. Modify repair_service::repair_tablets to gather repair_tablet_metas only on local shard. repair_tablets is invoked on all shards. Add a new legacy_tablet_repair_task_impl that covers tablet repair started with async_repair. A user can use sequence number of this task to manage the repair using storage_service API. In a test that reproduced this, we have seen 11136 tablets and 5636096 bytes allocation failure. If we had a node with 250 shards, 100 tablets each, we could reach 12MB kept on one shard for the whole repair time. Fixes: https://github.com/scylladb/scylladb/issues/23632 Needs backport to all live branches as they are all vulnerable to such crashes. Closes scylladb/scylladb#24194 * github.com:scylladb/scylladb: repair: distribute tablet_repair_task_meta among shards repair: do not keep erm in tablet_repair_task_meta	2025-08-05 22:36:59 +03:00
Avi Kivity	a8193bd503	Merge '[Backport 2025.3] transport: remove throwing protocol_exception on connection start' from Dario Mirovic `protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future. This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future. There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it. Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance. In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test. Testing Build: `release` Test file: `test/cqlpy/test_protocol_exceptions.py` Test name: `test_protocol_version_mismatch` (modified for mass connection requests) Test arguments: ``` max_attempts=100'000 num_parallel=10 ``` Throwing `protocol_exception` results: ``` real=1:26.97 user=10:00.27 sys=2:34.55 cpu=867% real=1:26.95 user=9:57.10 sys=2:32.50 cpu=862% real=1:26.93 user=9:56.54 sys=2:35.59 cpu=865% real=1:26.96 user=9:54.95 sys=2:32.33 cpu=859% real=1:26.96 user=9:53.39 sys=2:33.58 cpu=859% real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862% # average ``` Returning `protocol_exception` as `result_with_exception` or an exceptional future: ``` real=1:18.46 user=9:12.21 sys=2:19.08 cpu=881% real=1:18.44 user=9:04.03 sys=2:17.91 cpu=869% real=1:18.47 user=9:12.94 sys=2:19.68 cpu=882% real=1:18.49 user=9:13.60 sys=2:19.88 cpu=883% real=1:18.48 user=9:11.76 sys=2:17.32 cpu=878% real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879% # average ``` This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567. Refs: #24567 Fixes: #25271 This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting. * (cherry picked from commit `7aaeed012e`) * (cherry picked from commit `30d424e0d3`) * (cherry picked from commit `9f4344a435`) * (cherry picked from commit `5390f92afc`) * (cherry picked from commit `4a6f71df68`) Parent PR: #24738 Closes scylladb/scylladb#25117 * github.com:scylladb/scylladb: test/cqlpy: add cpp exception metric test conditions transport/server: replace protocol_exception throws with returns utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception transport/server: avoid exception-throw overhead in handle_error test/cqlpy: add protocol_exception tests	2025-08-05 14:16:14 +03:00
Patryk Jędrzejczak	eb8ea703d5	docs: document the option to set recovery_leader later In one of the previous commits, we made it possible to set `recovery_leader` on each node just before restarting it. Here, we update the corresponding documentation. (cherry picked from commit `f408d1fa4f`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	ac7945e044	test: delay setting recovery_leader in the recovery procedure tests In the previous commit, we made it possible to set `recovery_leader` on each node just before restarting it. Here, we change all the tests of the Raft-based recovery procedure to use and test this option. (cherry picked from commit `9e45e1159b`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	79c27454d4	gossip: add recovery_leader to gossip_digest_syn In the new Raft-based recovery procedure, live nodes join the new group 0 one by one during a rolling restart. There is a time window when some of them are in the old group 0, while others are in the new group 0. This causes a group 0 mismatch in `gossiper::handle_syn_msg`. The current solution for this problem is to ignore group 0 mismatches if `recovery_leader` is set on the local node and to ask the administrator to perform the rolling restart in the following way: - set `recovery_leader` in `scylla.yaml` on all live nodes, - send the `SIGHUP` signal to all Scylla processes to reload the config, - proceed with the rolling restart. This commit makes `gossiper::handle_syn_msg` ignore group 0 mismatches when exactly one of the two gossiping nodes has `recovery_leader` set. We achieve this by adding `recovery_leader` to `gossip_digest_syn`. This change makes setting `recovery_leader` earlier on all nodes and reloading the config unnecessary. From now on, the administrator can simply restart each node with `recovery_leader` set. However, note that nodes that join group 0 must have `recovery_leader` set until all nodes join the new group 0. For example, assume that we are in the middle of the rolling restart and one of the nodes in the new group 0 crashes. It must be restarted with `recovery_leader` set, or else it would reject `gossip_digest_syn` messages from nodes in the old group 0. To avoid problems in such cases, we will continue to recommend setting `recovery_leader` in `scylla.yaml` instead of passing it as a command line argument. (cherry picked from commit `ba5b5c7d2f`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	4294669e72	db: system_keyspace: peers_table_read_fixup: remove rows with null host_id Currently, `peers_table_read_fixup` removes rows with no `host_id`, but not with null `host_id`. Null host IDs are known to appear in system tables, for example in `system.cluster_status` after a failed bootstrap. We better make sure we handle them properly if they ever appear in `system.peers`. This commit guarantees that null UUID cannot belong to `loaded_endpoints` in `storage_service::join_cluster`, which in particular ensures that we throw a runtime error when a user sets `recovery_leader` to null UUID during the recovery procedure. This is handled by the code verifying that `recovery_leader` belongs to `loaded_endpoints`. (cherry picked from commit `23f59483b6`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	74cf95a675	db/config, gms/gossiper: change recovery_leader to UUID We change the type of the `recovery_leader` config parameter and `gossip_config::recovery_leader` from sstring to UUID. `recovery_leader` is supposed to store host ID, so UUID is a natural choice. After changing the type to UUID, if the user provides an incorrect UUID, parsing `recovery_leader` will fail early, but the start-up will continue. Outside the recovery procedure, `recovery_leader` will then be ignored. In the recovery procedure, the start-up will fail on: ``` throw std::runtime_error( "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. " "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader."); ``` (cherry picked from commit `445a15ff45`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	d18d2fa0cf	db/config, utils: allow using UUID as a config option We change the `recovery_leader` option to UUID in the following commit. (cherry picked from commit `ec69028907`)	2025-08-05 10:59:39 +00:00
Jenkins Promoter	3d4ec918ff	Update ScyllaDB version to: 2025.3.0-rc3	2025-08-03 15:50:47 +03:00
Nikos Dragazis	257ebbeca9	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995 (cherry picked from commit `2656fca504`) Closes scylladb/scylladb#25300	2025-08-02 17:12:05 +03:00
Ran Regev	7aa7f50b3a	scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec Fixes: #24758 Updated scylla.yaml and the help for scylla --help Closes scylladb/scylladb#24793 (cherry picked from commit `db4f301f0c`) Closes scylladb/scylladb#25280	2025-08-01 15:02:01 +03:00
Piotr Dulikowski	0dc700de70	Merge '[Backport 2025.3] qos: don't populate effective service level cache until auth is migrated to raft' from Scylladb[bot] Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). - (cherry picked from commit `2bb800c004`) - (cherry picked from commit `3a082d314c`) Parent PR: #25188 Closes scylladb/scylladb#25285 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-08-01 08:49:13 +02:00
Jenkins Promoter	308400895f	Update pgo profiles - aarch64	2025-08-01 05:19:18 +03:00
Jenkins Promoter	54b259bec9	Update pgo profiles - x86_64	2025-08-01 05:02:34 +03:00
Piotr Dulikowski	f27a3be62b	test: sl: verify that legacy auth is not queried in sl to raft upgrade Adjust `test_service_levels_upgrade`: right before upgrade to topology on raft, enable an error injection which triggers when the standard role manager is about to query the legacy auth tables in the system_auth keyspace. The preceding commit which fixes scylladb/scylladb#24963 makes sure that the legacy tables are not queried during upgrade to topology on raft, so the error injection does not trigger and does not cause a problem; without that commit, the test fails. (cherry picked from commit `3a082d314c`)	2025-07-31 15:13:57 +00:00
Piotr Dulikowski	ba70b39486	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 (cherry picked from commit `2bb800c004`)	2025-07-31 15:13:57 +00:00
Andrzej Jackowski	12866e8f2e	test: audit: ignore cassandra user audit logs in AUTH tests Audit tests are vulnerable to noise from LOGIN queries (because AUTH audit logs can appear at any time). Most tests already use the `filter_out_noise` mechanism to remove this noise, but tests focused on AUTH verification did not, leading to sporadic failures. This change adds a filter to ignore AUTH logs generated by the default "cassandra" user, so tests only verify logs from the user created specifically for each test. Fixes: scylladb/scylladb#25069 (cherry picked from commit `aef6474537`)	2025-07-31 17:01:29 +02:00
Andrzej Jackowski	e77a190f1a	test: audit: change names of `filter_out_noise` parameters This is a refactoring commit that changes the names of the parameters of the `filter_out_noise` function, as well as names of related variables. The motiviation for the change is introduction of more complex filtering logic in next commit of this patch series. Refs: scylladb/scylladb#25069 (cherry picked from commit `daf1c58e21`)	2025-07-31 16:58:36 +02:00
Aleksandra Martyniuk	132e6495a3	repair: distribute tablet_repair_task_meta among shards Currently, in repair_service::repair_tablets a shard that initiates the repair keeps tablet_repair_task_meta of all tablets that have a replica on this node (on any shard). This may lead to oversized allocations. Add remote_metas class which takes care of distributing tablet_repair_task_meta among different shards. An additional class remote_metas_builder was added in order to ensure safety and separate writes and reads to meta vectors. Fixes: #23632	2025-07-31 15:56:53 +02:00
Aleksandra Martyniuk	603a2dbb10	repair: do not keep erm in tablet_repair_task_meta Do not keep erm in tablet_repair_task_meta to avoid non-owner shared pointer access when metas will be distributes among shards. Pass std::chunked_vector of erms to tablet_repair_task_impl to preserve safety.	2025-07-31 15:56:43 +02:00
Dario Mirovic	7d300367c0	test/cqlpy: add cpp exception metric test conditions Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions` metric is used. This is a global metric. To address potential test flakiness, each test runs multiple times: - `run_count = 100` - `cpp_exception_threshold = 10` If a change in the code introduced an exception, expectation is that the number of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs. In which case the test fails. Fixes: #25271 (cherry picked from commit `4a6f71df68`)	2025-07-31 11:53:00 +02:00
Anna Stuchlik	4bc531d48d	doc: add the upgrade guide from 2025.2 to 2025.3 This PR adds the upgrade guide from version 2025.2 to 2025.3. Also, it removes the upgrade guide existing for the previous version that is irrelevant in 2025.2 (upgrade from 2025.1 to 2025.2). Note that the new guide does not include the "Enable Consistent Topology Updates" page and note, as users upgrading to 2025.3 have consistent topology updates already enabled. Fixes https://github.com/scylladb/scylladb/issues/24696 Closes scylladb/scylladb#25219 (cherry picked from commit `8365219d40`) Closes scylladb/scylladb#25248	2025-07-31 12:19:33 +03:00
Anna Stuchlik	f3ca644a55	doc: add OS support for ScyllaDB 2025.3 This commit adds the information about support for platforms in ScyllaDB version 2025.3. Fixes https://github.com/scylladb/scylladb/issues/24698 Closes scylladb/scylladb#25220 (cherry picked from commit `b67bb641bc`) Closes scylladb/scylladb#25249	2025-07-31 12:17:36 +03:00
Anna Stuchlik	573bbace20	doc: add tablets support information to the Drivers table This commit: - Extends the Drivers support table with information on which driver supports tablets and since which version. - Adds the driver support policy to the Drivers page. - Reorganizes the Drivers page to accommodate the updates. In addition: - The CPP-over-Rust driver is added to the table. - The information about Serverless (which we don't support) is removed and replaced with tablets to correctly describe the contents of the table. Fixes https://github.com/scylladb/scylladb/issues/19471 Refs https://github.com/scylladb/scylladb-docs-homepage/issues/69 Closes scylladb/scylladb#24635 (cherry picked from commit `18b4d4a77c`) Closes scylladb/scylladb#25251	2025-07-31 12:17:21 +03:00
Aleksandra Martyniuk	4630a2f9c5	streaming: close sink when exception is thrown If an exception is thrown in result_handling_cont in streaming, then the sink does not get closed. This leads to a node crash. Close sink in exception handler. Fixes: https://github.com/scylladb/scylladb/issues/25165. Closes scylladb/scylladb#25238 (cherry picked from commit `99ff08ae78`) Closes scylladb/scylladb#25268	2025-07-31 12:17:05 +03:00
Dario Mirovic	38a8318466	transport/server: replace protocol_exception throws with returns Replace throwing protocol_exception with returning it as a result or an exceptional future in the transport server module. This improves performance, for example during connection storms and server restarts, where protocol exceptions are more frequent. In functions already returning a future, protocol exceptions are propagated using an exceptional future. In functions not already returning a future, result_with_exception is used. Notable change is checking v.failed() before calling v.get() in process_request function, to avoid throwing in case of an exceptional future. Refs: #24567 Fixes: #25271 (cherry picked from commit `5390f92afc`)	2025-07-30 21:35:24 +02:00
Dario Mirovic	1078a1f03a	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567 Fixes: #25271 (cherry picked from commit `9f4344a435`)	2025-07-30 21:35:15 +02:00
Dario Mirovic	0679a7bb78	transport/server: avoid exception-throw overhead in handle_error Previously, connection::handle_error always called f.get() inside a try/catch, forcing every failed future to throw and immediately catch an exception just to classify it. This change eliminates that extra throw/catch cycle by first checking f.failed(), getting the stored std::exception_ptr via f.get_exception(), and then dispatching on its type via utils::try_catch<T>(eptr). The error-response logic is not changed - cassandra_exception, std::exception, and unknown exceptions are caught and processed, and any exceptions thrown by write_response while handling those exceptions continues to escape handle_error. Refs: #24567 Fixes: #25271 (cherry picked from commit `30d424e0d3`)	2025-07-30 21:34:56 +02:00
Dario Mirovic	918d4ab5fb	test/cqlpy: add protocol_exception tests Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter from Scylla's metrics endpoint. These metrics are used to track protocol error count before and after each test. Add cql_with_protocol context manager utility for session creation with parameterized protocol_version value. This is used for testing connection establishment with different protocol versions, and proper disposal of successfully established sessions. The tests cover two failure scenarios: - Protocol version mismatch in test_protocol_version_mismatch which tests both supported and unsupported protocol version - Malformed frames via raw socket in _protocol_error_impl, used by several test functions, and also test_no_protocol_exceptions test to assert that the error counters never decrease during test execution, catching unintended metric resets Refs: #24567 Fixes: #25271 (cherry picked from commit `7aaeed012e`)	2025-07-30 21:34:31 +02:00
Patryk Jędrzejczak	7164f11b99	Merge '[Backport 2025.3] Revert 24418: main.cc: fix group0 shutdown order' from Petr Gusev This PR reverts the changes of #24418 since they can cause use-after-free. The `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`). However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used. This PR reverts two of the three commits from #24418. The commit [`e456d2d`](`e456d2d507`) is not reverted because it only affects logging and does not impact correctness. Fixes scylladb/scylladb#25221 Backport: this PR is a backport Closes scylladb/scylladb#25206 * https://github.com/scylladb/scylladb: Revert "main.cc: fix group0 shutdown order" Revert "storage_service: test_group0_apply_while_node_is_being_shutdown"	2025-07-30 16:18:13 +02:00
Pavel Emelyanov	99f328b7a7	Merge '[Backport 2025.3] s3_client: Enhance s3_client error handling' from Scylladb[bot] Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range. Fixes: https://github.com/scylladb/scylladb/issues/25043 Should be backported to 2025.3 since we have an intention to release native backup/restore feature - (cherry picked from commit `d53095d72f`) - (cherry picked from commit `b7ae6507cd`) - (cherry picked from commit `ba910b29ce`) - (cherry picked from commit `fc2c9dd290`) Parent PR: #24883 Closes scylladb/scylladb#25137 * github.com:scylladb/scylladb: s3_client: Disable Seastar-level retries in HTTP client creation s3_test: Validate handling of non-`aws_error` exceptions s3_client: Improve error handling in chunked_download_source aws_error: Add factory method for `aws_error` from exception	2025-07-29 14:42:45 +03:00
Pavel Emelyanov	07f46a4ad5	Merge '[Backport 2025.3] storage_service: cancel all write requests after stopping transports' from Scylladb[bot] When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore. If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out. This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped. Fixes scylladb/scylladb#23665 Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3. - (cherry picked from commit `bc934827bc`) - (cherry picked from commit `e0dc73f52a`) Parent PR: #24714 Closes scylladb/scylladb#25170 * github.com:scylladb/scylladb: storage_service: Cancel all write requests on storage_proxy shutdown test: Add test for unfinished writes during shutdown and topology change	2025-07-29 14:42:25 +03:00
Taras Veretilnyk	a9f5e7d18f	docs: fix typo in command name enbleautocompaction -> enableautocompaction Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'. Fixes scylladb/scylladb#25172 Closes scylladb/scylladb#25175 (cherry picked from commit `6b6622e07a`) Closes scylladb/scylladb#25218	2025-07-29 14:41:50 +03:00
Petr Gusev	d8f6a497a5	Revert "main.cc: fix group0 shutdown order" This reverts commit `6b85ab79d6`.	2025-07-28 17:50:38 +02:00
Petr Gusev	c98dde92db	Revert "storage_service: test_group0_apply_while_node_is_being_shutdown" This reverts commit `b1050944a3`.	2025-07-28 17:49:03 +02:00
Aleksandra Martyniuk	8efee38d6f	tasks: do not use binary progress for task manager tasks Currently, progress of a parent task depends on expected_total_workload, expected_children_number, and children progresses. Basically, if total workload is known or all children have already been created, progresses of children are summed up. Otherwise binary progress is returned. As a result, two tasks of the same type may return progress in different units. If they are children of the same task and this parent gathers the progress - it becomes meaningless. Drop expected_children_number as we can't assume that children are able to show their progresses. Modify get_progress method - progress is calculated based on children progresses. If expected_total_workload isn't specified, the total progress of a task may grow. If expected_total_workload isn't specified and no children are created, empty progress (0/0) is returned. Fixes: https://github.com/scylladb/scylladb/issues/24650. Closes scylladb/scylladb#25113 (cherry picked from commit `a7ee2bbbd8`) Closes scylladb/scylladb#25200	2025-07-28 13:11:45 +03:00
Michael Litvak	934260e9a9	storage service: drain view builder before group0 The view builder uses group0 operations to coordinate view building, so we should drain the view builder before stopping group0. Fixes scylladb/scylladb#25096 Closes scylladb/scylladb#25101 (cherry picked from commit `3ff388cd94`) Closes scylladb/scylladb#25198	2025-07-28 13:05:14 +03:00
Nadav Har'El	583c118ccd	Merge '[Backport 2025.3] alternator: avoid oversized allocation in Query/Scan' from Scylladb[bot] This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535 The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport. - (cherry picked from commit `2385fba4b6`) - (cherry picked from commit `d8fab2a01a`) - (cherry picked from commit `13ec94107a`) - (cherry picked from commit `a248336e66`) Parent PR: #24480 Closes scylladb/scylladb#25194 * github.com:scylladb/scylladb: alternator: clean up by co-routinizing alternator: avoid spamming the log when failing to write response alternator: clean up and simplify request_return_type alternator: avoid oversized allocation in Query/Scan	2025-07-27 14:12:49 +03:00
Nadav Har'El	f1c5350141	alternator: clean up by co-routinizing Reviewers of the previous patch complained on some ugly pre-existing code in alternator/executor.cc, where returning from an asynchronous (future) function require lengthy verbose casts. So this patch cleans up a few instances of these ugly casts by using co_return instead of return. For example, the long and verbose return make_ready_future<executor::request_return_type>( rjson::print(std::move(response))); can be changed to the shorter and more readable co_return rjson::print(std::move(response)); This patch should not have any functional implications, and also not any performance implications: I only coroutinized slow-path functions and one function that was already "partially" coroutinized (and this was expecially ugly and deserved being fixed). Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `a248336e66`)	2025-07-27 07:42:01 +00:00
Nadav Har'El	f897f38003	alternator: avoid spamming the log when failing to write response Both make_streamed() and new make_streamed_with_extra_array() functions, used when returning a long response in Alternator, would write an error- level log message if it failed to write the response. This log message is probably not helpful, and may spam the log if the application causes repeated errors intentionally or accidentally. So drop these log messages. The exception is still thrown as usual. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `13ec94107a`)	2025-07-27 07:42:01 +00:00
Nadav Har'El	fe037663ea	alternator: clean up and simplify request_return_type The previous patch introduced a function make_streamed_with_extra_array which was a duplicate of the existing make_streamed. Reviewers complained how baroque the new function is (just like the old function), having to jump through hoops to return a copyable function working on non-copyable objects, making strange-named copies and shared pointers of everything. We needed to return a copyable function (std::function) just because Alternator used Seastar's json::json_return_type in the return type from executor function (request_return_type). This json_return_type contained either a sstring or an std::function, but neither was ever really appropriate: 1. We want to return noncopyable_function, not an std::function! 2. We want to return an std::string (which rjson::print()) returns, not an sstring! So in this patch we stop using seastar::json::json_return_type entirely in Alternator. Alternator's request_return_type is now an std::variant of three types: 1. std::string for short responses, 2. noncopyable_function for long streamed response 3. api_error for errors. The ugliest parts of make_streamed() where we made copies and shared pointers to allow for a copyable function are all gone. Even nicer, a lot of other ugly relics of using seastar::json_return_type are gone: 1. We no longer need obscure classes and functions like make_jsonable() and json_string() to convert strings to response bodies - an operation can simply return a string directly - usually returning rjson::print(value) or a fixed string like "" and it just works. 2. There is no more usage of seastar::json in Alternator (except one minor use of seastar::json::formatter::to_json in streams.cc that can be removed later). Alternator uses RapidJSON for its JSON needs, we don't need to use random pieces from a different JSON library. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `d8fab2a01a`)	2025-07-27 07:42:01 +00:00
Nadav Har'El	b7da50d781	alternator: avoid oversized allocation in Query/Scan This patch fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535 (cherry picked from commit `2385fba4b6`)	2025-07-27 07:42:01 +00:00
Pavel Emelyanov	7c04619ecf	Merge '[Backport 2025.3] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot] Fixes #24574 * Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert * Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free. - (cherry picked from commit `ee98f5d361`) - (cherry picked from commit `8d37e5e24b`) Parent PR: #24633 Closes scylladb/scylladb#24772 * github.com:scylladb/scylladb: encryption_at_rest_test: Add exception handler to ensure proxy stop encryption: Ensure stopping timers in provider cache objects	2025-07-24 16:35:53 +03:00
Pavel Emelyanov	b07f4fb26b	Merge '[Backport 2025.3] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot] This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. Even if we didn't deadlock, and the streaming semaphore was simply exhausted by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes #24807 Fixes #24925 - (cherry picked from commit `ee2fa58bd6`) - (cherry picked from commit `dff2b01237`) Parent PR: #24929 Closes scylladb/scylladb#25058 * github.com:scylladb/scylladb: streaming: Avoid deadlock by running view checks in a separate scheduling group service: migration_manager: Run group0 barrier in gossip scheduling group	2025-07-24 16:35:24 +03:00
Ran Regev	c5f4ad3665	nodetool restore: sstable list from a file Fixes: #25045 added the ability to supply the list of files to restore from the a given file. mainly required for local testing. Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#25077 (cherry picked from commit `dd67d22825`) Closes scylladb/scylladb#25124	2025-07-24 16:35:04 +03:00
Ran Regev	013e0d685c	docs: update nodetool restore documentation for --sstables-file-list Fixes: #25128 A leftover from #25077 Closes scylladb/scylladb#25129 (cherry picked from commit `3d82b9485e`) Closes scylladb/scylladb#25139	2025-07-24 16:34:39 +03:00
Jakub Smolar	800f819b5b	gdb: handle zero-size reads in managed_bytes Fixes: https://github.com/scylladb/scylladb/issues/25048 Closes scylladb/scylladb#25050 (cherry picked from commit `6e0a063ce3`) Closes scylladb/scylladb#25142	2025-07-24 16:34:04 +03:00
Sergey Zolotukhin	8ac6aaadaf	storage_service: Cancel all write requests on storage_proxy shutdown During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown` as one of the first steps. However, even after RPCs are shut down, some write handlers in `storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM. Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block the messaging server shutdown and delay the entire shutdown process until the write timeout occurs. This change introduces the cancellation of all outstanding write handlers in `storage_proxy` during shutdown to prevent unnecessary delays. Fixes scylladb/scylladb#23665 (cherry picked from commit `e0dc73f52a`)	2025-07-24 13:03:32 +00:00
Sergey Zolotukhin	16a8cd9514	test: Add test for unfinished writes during shutdown and topology change This test reproduces an issue where a topology change and an ongoing write query during query coordinator shutdown can cause the node to get stuck. When a node receives a write request, it creates a write handler that holds a copy of the current table's ERM (Effective Replication Map). The ERM ensures that no topology or schema changes occur while the request is being processed. After the query coordinator receives the required number of replica write ACKs to satisfy the consistency level (CL), it sends a reply to the client. However, the write response handler remains alive until all replicas respond — the remaining writes are handled in the background. During shutdown, when all network connections are closed, these responses can no longer be received. As a result, the write response handler is only destroyed once the write timeout is reached. This becomes problematic because the ERM held by the handler blocks topology or schema change commands from executing. Since shutdown waits for these commands to complete, this can lead to unnecessary delays in node shutdown and restarts, and occasional test case failures. Test for: scylladb/scylladb#23665 (cherry picked from commit `bc934827bc`)	2025-07-24 13:03:32 +00:00
Ernest Zaslavsky	e45852a595	s3_client: Disable Seastar-level retries in HTTP client creation Prevent Seastar from retrying HTTP requests to avoid buffer double-feed issues when an entire request is retried. This could cause data corruption in `chunked_download_source`. The change is global for every instance of `s3_client`, but it is still safe because: * Seastar's `http_client` resets connections regardless of retry behavior * `s3_client` retry logic handles all error types—exceptions, HTTP errors, and AWS-specific errors—via `http_retryable_client` (cherry picked from commit `fc2c9dd290`)	2025-07-22 16:46:54 +00:00
Ernest Zaslavsky	fdf706a6eb	s3_test: Validate handling of non-`aws_error` exceptions Inject exceptions not wrapped in `aws_error` from request callback lambda to verify they are properly caught and handled. (cherry picked from commit `ba910b29ce`)	2025-07-22 16:46:53 +00:00
Ernest Zaslavsky	2bc3accf9c	s3_client: Improve error handling in chunked_download_source Create aws_error from raised exceptions when possible and respond appropriately. Previously, non-aws_exception types leaked from the request handler and were treated as non-retryable, causing potential data corruption during download. (cherry picked from commit `b7ae6507cd`)	2025-07-22 16:46:53 +00:00
Ernest Zaslavsky	0106d132bd	aws_error: Add factory method for `aws_error` from exception Move `aws_error` creation logic out of `retryable_http_client` and into the `aws_error` class to support reuse across components. (cherry picked from commit `d53095d72f`)	2025-07-22 16:46:53 +00:00
Pavel Emelyanov	53637fdf61	Merge '[Backport 2025.3] storage: add `make_data_or_index_source` to the storages' from Scylladb[bot] Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance * Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. * Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage` Fixes: https://github.com/scylladb/scylladb/issues/22458 - (cherry picked from commit `211daeaa40`) - (cherry picked from commit `7e5e3c5569`) - (cherry picked from commit `0de61f56a2`) - (cherry picked from commit `8ac2978239`) - (cherry picked from commit `dff9a229a7`) - (cherry picked from commit `8d49bb8af2`) Parent PR: #23695 Closes scylladb/scylladb#25016 * github.com:scylladb/scylladb: sstables: Start using `make_data_or_index_source` in `sstable` sstables: refactor readers and sources to use coroutines sstables: coroutinize futurized readers sstables: add `make_data_or_index_source` to the `storage` encryption: refactor key retrieval encryption: add `encrypted_data_source` class	2025-07-21 18:05:53 +03:00
Piotr Dulikowski	fdfcd67a6e	Merge '[Backport 2025.3] cdc: Forbid altering columns of CDC log tables directly' from Scylladb[bot] The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 Backport: we should backport the change to all affected branches to prevent the consequences that may affect the user. - (cherry picked from commit `20d0050f4e`) - (cherry picked from commit `59800b1d66`) Parent PR: #25008 Closes scylladb/scylladb#25108 * github.com:scylladb/scylladb: cdc: Forbid altering columns of inactive CDC log table cdc: Forbid altering columns of CDC log tables directly	2025-07-21 16:22:31 +02:00
Dawid Mędrek	dc6cb5cfad	cdc: Forbid altering columns of inactive CDC log table When CDC becomes disabled on the base table, the CDC log table still exsits (cf. scylladb/scylladb@adda43edc7). If it continues to exist up to the point when CDC is re-enabled on the base table, no new log table will be created -- instead, the old olg table will be re-attached. Since we want to avoid situations when the definition of the log table has become misaligned with the definition of the base table due to actions of the user, we forbid modifying the set of columns or renaming them in CDC log tables, even when they're inactive. Validation tests are provided. (cherry picked from commit `59800b1d66`)	2025-07-21 11:43:49 +00:00
Dawid Mędrek	10a9ced4d1	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 (cherry picked from commit `20d0050f4e`)	2025-07-21 11:43:49 +00:00
Ernest Zaslavsky	934359ea28	s3_client: parse multipart response XML defensively Ensure robust handling of XML responses when initiating multipart uploads. Check for the existence of required nodes before access, and throw an exception if the XML is empty or malformed. Refs: https://github.com/scylladb/scylladb/issues/24676 Closes scylladb/scylladb#24990 (cherry picked from commit `342e94261f`) Closes scylladb/scylladb#25057	2025-07-21 12:03:00 +02:00
Piotr Dulikowski	74d97711fd	Merge '[Backport 2025.3] cdc: throw error if column doesn't exist' from Scylladb[bot] in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes scylladb/scylladb#/24952 backport needed - simple fix for a node crash - (cherry picked from commit `b336f282ae`) - (cherry picked from commit `86dfa6324f`) Parent PR: #24986 Closes scylladb/scylladb#25067 * github.com:scylladb/scylladb: test: cdc: add test_cdc_with_alter cdc: throw error if column doesn't exist	2025-07-21 11:18:06 +02:00
Jenkins Promoter	fc7a6b66e2	Update ScyllaDB version to: 2025.3.0-rc2	2025-07-20 15:44:21 +03:00
Michael Litvak	594ec7d66d	test: cdc: add test_cdc_with_alter Add a test that tests adding and dropping a column to a table with CDC enabled while writing to it. (cherry picked from commit `86dfa6324f`)	2025-07-20 09:04:00 +02:00
Michael Litvak	338ff18dfe	cdc: throw error if column doesn't exist in the CDC log transformer, when creating a CDC mutation based on some base table mutation, for each value of a base column we set the value in the CDC column with the same name. When looking up the column in the CDC schema by name, we may get a null pointer if a column by that name is not found. This shouldn't happen normally because the base schema and CDC schema should be compatible, and for each base column there should be a CDC column with the same name. However, there are scenarios where the base schema and CDC schema are incompatible for a short period of time when they are being altered. When a base column is being added or dropped, we could get a base mutation with this column set, and then the CDC transformer picks up the latest CDC schema which doesn't have this column. If such thing happens, we fix the code to throw an exception instead of crashing on null pointer dereference. Currently we don't have a safer approach to handle this, but this might be changed in the future. The other alternative is dropping that data silently which we prefer not to do. Throwing an error is acceptable because this scenario most likely indicates this behavior by the user: * The user adds a new column, and start writing values to the column before the ALTER is complete. or, * The user drops a column, and continues writing values to the column while it's being dropped. Both cases might as well fail with an error because the column is not found in the base table. Fixes scylladb/scylladb#24952 (cherry picked from commit `b336f282ae`)	2025-07-18 10:36:44 +00:00
Tomasz Grabiec	888e92c969	streaming: Avoid deadlock by running view checks in a separate scheduling group This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes: #24807 (cherry picked from commit `dff2b01237`)	2025-07-17 17:25:44 +00:00
Tomasz Grabiec	f424c773a4	service: migration_manager: Run group0 barrier in gossip scheduling group Fixes two issues. One is potential priority inversion. The barrier will be executed using scheduling group of the first fiber which triggers it, the rest will block waiting on it. For example, CQL statements which need to sync the schema on replica side can block on the barrier triggered by streaming. That's undesirable. This is theoretical, not proved in the field. The second problem is blocking the error path. This barrier is called from the streaming error handling path. If the streaming concurrency semaphore is exhausted, and streaming fails due to timeout on obtaining the permit in check_needs_view_update_path(), the error path will block too because it will also attempt to obtain the permit as part of the group0 barrier. Running it in the gossip scheduling group prevents this. Fixes #24925 (cherry picked from commit `ee2fa58bd6`)	2025-07-17 17:25:44 +00:00
Piotr Dulikowski	e49b312be9	auth: fix crash when migration code runs parallel with raft upgrade The functions password_authenticator::start and standard_role_manager::start have a similar structure: they spawn a fiber which invokes a callback that performs some migration until that migration succeeds. Both handlers set a shared promise called _superuser_created_promise (those are actually two promises, one for the password authenticator and the other for the role manager). The handlers are similar in both cases. They check if auth is in legacy mode, and behave differently depending on that. If in legacy mode, the promise is set (if it was not set before), and some legacy migration actions follow. In auth-on-raft mode, the superuser is attempted to be created, and if it succeeds then the promise is _unconditionally_ set. While it makes sense at a glance to set the promise unconditionally, there is a non-obvious corner case during upgrade to topology on raft. During the upgrade, auth switches from the legacy mode to auth on raft mode. Thus, if the callback didn't succeed in legacy mode and then tries to run in auth-on-raft mode and succeds, it will unconditionally set a promise that was already set - this is a bug and triggers an assertion in seastar. Fix the issue by surrounding the `shared_promise::set_value` call with an `if` - like it is already done for the legacy case. Fixes: scylladb/scylladb#24975 Closes scylladb/scylladb#24976 (cherry picked from commit `a14b7f71fe`) Closes scylladb/scylladb#25019	2025-07-17 13:32:35 +02:00
Ernest Zaslavsky	549d139e84	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`. (cherry picked from commit `8d49bb8af2`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	4a47262167	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use. (cherry picked from commit `dff9a229a7`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	81d356315b	sstables: coroutinize futurized readers Coroutinize futurized readers and sources to get ready for using `make_data_or_index_source` in `sstable` (cherry picked from commit `8ac2978239`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	4ffd72e597	sstables: add `make_data_or_index_source` to the `storage` Add `make_data_or_index_source` to the `storage` interface, implement it for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`. (cherry picked from commit `0de61f56a2`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	8998f221ab	encryption: refactor key retrieval Get the encryption schema extension retrieval code out of `wrap_file` method to make it reusable elsewhere (cherry picked from commit `7e5e3c5569`)	2025-07-16 12:45:58 +00:00
Ernest Zaslavsky	243ba1fb66	encryption: add `encrypted_data_source` class Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior. NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is Co-authored-by: Calle Wilund <calle@scylladb.com> (cherry picked from commit `211daeaa40`)	2025-07-16 12:45:58 +00:00
Patryk Jędrzejczak	7caacf958b	test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when: - both writes succeeded with the same replica responding first, - one of the following reads succeeded with the other replica responding before it applied mutations from any of the writes. We fix the test by not expecting reads with CL=ONE to return a row. We also harden the test by inserting different rows for every pair (CL, coordinator), where one of the two coordinators is a normal node from DC1, and the other one is a zero-token node from DC2. This change makes sure that, for example, every write really inserts a row. Fixes scylladb/scylladb#22967 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#23518 (cherry picked from commit `21edec1ace`) Closes scylladb/scylladb#24985	2025-07-15 15:47:43 +02:00
Botond Dénes	489e4fdb4e	Merge '[Backport 2025.3] S3 chunked download source bug fixes' from Scylladb[bot] - Fix missing negation in the `if` in the background downloading fiber - Add test to catch this case - Improve the s3 proxy to inject errors if the same resource requested more than once - Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque - Inject exception from the test to simulate response callback failure in the middle No need to backport anything since this class in not used yet - (cherry picked from commit `f1d0690194`) - (cherry picked from commit `e73b83e039`) - (cherry picked from commit `6d9cec558a`) - (cherry picked from commit `ec59fcd5e4`) - (cherry picked from commit `c75acd274c`) - (cherry picked from commit `d2d69cbc8c`) - (cherry picked from commit `e50f247bf1`) - (cherry picked from commit `49e8c14a86`) - (cherry picked from commit `a5246bbe53`) - (cherry picked from commit `acf15eba8e`) Parent PR: #24657 Closes scylladb/scylladb#24943 * github.com:scylladb/scylladb: s3_test: Add s3_client test for non-retryable error handling s3_test: Add trace logging for default_retry_strategy s3_client: Fix edge case when the range is exhausted s3_client: Fix indentation in try..catch block s3_client: Stop retries in chunked download source s3_client: Enhance test coverage for retry logic s3_client: Add test for Content-Range fix s3_client: Fix missing negation s3_client: Refine logging s3_client: Improve logging placement for current_range output	2025-07-15 15:28:48 +03:00
Michael Litvak	26738588db	tablets: stop storage group on deallocation When a tablet transitions to a post-cleanup stage on the leaving replica we deallocate its storage group. Before the storage can be deallocated and destroyed, we must make sure it's cleaned up and stopped properly. Normally this happens during the tablet cleanup stage, when table::cleanup_table is called, so by the time we transition to the next stage the storage group is already stopped. However, it's possible that tablet cleanup did not run in some scenario: 1. The topology coordinator runs tablet cleanup on the leaving replica. 2. The leaving replica is restarted. 3. When the leaving replica starts, still in `cleanup` stage, it allocates a storage group for the tablet. 4. The topology coordinator moves to the next stage. 5. The leaving replica deallocates the storage group, but it was not stopped. To address this scenario, we always stop the storage group when deallocating it. Usually it will be already stopped and complete immediately, and otherwise it will be stopped in the background. Fixes scylladb/scylladb#24857 Fixes scylladb/scylladb#24828 Closes scylladb/scylladb#24896 (cherry picked from commit `fa24fd7cc3`) Closes scylladb/scylladb#24909	2025-07-15 13:14:35 +03:00
Aleksandra Martyniuk	f69f59afbd	repair: Reduce max row buf size when small table optimization is on If small_table_optimization is on, a repair works on a whole table simultaneously. It may be distributed across the whole cluster and all nodes might participate in repair. On a repair master, row buffer is copied for each repair peer. This means that the memory scales with the number of peers. In large clusters, repair with small_table_optimization leads to OOM. Divide the max_row_buf_size by the number of repair peers if small_table_optimization is on. Use max_row_buf_size to calculate number of units taken from mem_sem. Fixes: https://github.com/scylladb/scylladb/issues/22244. Closes scylladb/scylladb#24868 (cherry picked from commit `17272c2f3b`) Closes scylladb/scylladb#24907	2025-07-15 13:13:49 +03:00
Łukasz Paszkowski	e1e0c721e7	test.py: Fix test_compactionhistory_rows_merged_time_window_compaction_strategy The test has two major problems 1. Wrongly computed time windows. Data was not spread across two 1-minute windows causing the test to generate even three sstables instead of two 2. Timestamp was not propagated to the prepared CQL statements. So in fact, a current time was used implicitly 3. Because of the incorrect timestamp issue, the remaining tests testing purged tombstones were affected as well. Fixes https://github.com/scylladb/scylladb/issues/24532 Closes scylladb/scylladb#24609 (cherry picked from commit `a22d1034af`) Closes scylladb/scylladb#24791	2025-07-15 13:12:39 +03:00
Yaron Kaikov	05a6d4da23	dist/common/scripts/scylla_sysconfig_setup: fix `SyntaxWarning: invalid escape sequence` There are invalid escape sequence warnings where raw strings should be used for the regex patterns Fixes: https://github.com/scylladb/scylladb/issues/24915 Closes scylladb/scylladb#24916 (cherry picked from commit `fdcaa9a7e7`) Closes scylladb/scylladb#24970	2025-07-15 11:01:28 +02:00
Yaron Kaikov	1e1aeed3cd	auto-backport.py: Avoid bot push to existing backport branches Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork. If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open. Fixes: https://github.com/scylladb/scylladb/issues/24953 Closes scylladb/scylladb#24954 (cherry picked from commit `ed7c7784e4`) Closes scylladb/scylladb#24969	2025-07-15 10:25:30 +02:00
Jenkins Promoter	af10d6f03b	Update pgo profiles - aarch64	2025-07-15 05:21:25 +03:00
Jenkins Promoter	0d3742227d	Update pgo profiles - x86_64	2025-07-15 04:58:36 +03:00
Yaron Kaikov	c6987e3fed	packaging: add `ps` command to dependancies ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator. Fixes: #24827 Closes scylladb/scylladb#24830 (cherry picked from commit `66ff6ab6f9`) Closes scylladb/scylladb#24956	2025-07-14 14:19:17 +03:00
Ernest Zaslavsky	873c8503cd	s3_test: Add s3_client test for non-retryable error handling Introduce a test that injects a non-retryable error and verifies that the chunked download source throws an exception as expected. (cherry picked from commit `acf15eba8e`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	dbf4bd162e	s3_test: Add trace logging for default_retry_strategy Introduce trace-level logging for `default_retry_strategy` in `s3_test` to improve visibility into retry logic during test execution. (cherry picked from commit `a5246bbe53`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	7f303bfda3	s3_client: Fix edge case when the range is exhausted Handle case where the download loop exits after consuming all data, but before receiving an empty buffer signaling EOF. Without this, the next request is sent with a non-zero offset and zero length, resulting in "Range request cannot be satisfied" errors. Now, an empty buffer is pushed to indicate completion and exit the fiber properly. (cherry picked from commit `49e8c14a86`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	22739df69f	s3_client: Fix indentation in try..catch block Correct indentation in the `try..catch` block to improve code readability and maintain consistent formatting. (cherry picked from commit `e50f247bf1`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	54db6ca088	s3_client: Stop retries in chunked download source Disable retries for S3 requests in the chunked download source to prevent duplicate chunks from corrupting the buffer queue. The response handler now throws an exception to bypass the retry strategy, allowing the next range to be attempted cleanly. This exception is only triggered for retryable errors; unretryable ones immediately halt further requests. (cherry picked from commit `d2d69cbc8c`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	c841ffe398	s3_client: Enhance test coverage for retry logic Extend the S3 proxy to support error injection when the client makes multiple requests to the same resource—useful for testing retry behavior and failure handling. (cherry picked from commit `c75acd274c`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	c748a97170	s3_client: Add test for Content-Range fix Introduce a test that accurately verifies the Content-Range behavior, ensuring the previous fix is properly validated. (cherry picked from commit `ec59fcd5e4`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	00f10e7f1d	s3_client: Fix missing negation Restore a missing `not` in a conditional check that caused incorrect behavior during S3 client execution. (cherry picked from commit `6d9cec558a`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	4cd1792528	s3_client: Refine logging Fix typo in log message to improve clarity and accuracy during S3 operations. (cherry picked from commit `e73b83e039`)	2025-07-13 13:17:14 +00:00
Ernest Zaslavsky	115e8c85e4	s3_client: Improve logging placement for current_range output Relocated logging to occur after determining the `current_range`, ensuring more relevant output during S3 client operations. (cherry picked from commit `f1d0690194`)	2025-07-13 13:17:14 +00:00
Gleb Natapov	087d3bb957	api: unregister raft_topology_get_cmd_status on shutdown In `c8ce9d1c60` we introduced raft_topology_get_cmd_status REST api but the commit forgot to unregister the handler during shutdown. Fixes #24910 Closes scylladb/scylladb#24911 (cherry picked from commit `89f2edf308`) Closes scylladb/scylladb#24923	2025-07-13 15:15:52 +03:00
Avi Kivity	f3297824e3	Revert "config: decrease default large allocation warning threshold to 128k" This reverts commit `04fb2c026d`. 2025.3 got the reduced threshold, but won't get many of the fixes the warning will generate, leaving it very noisy. Better to avoid the noise for this release. Fixes #24384.	2025-07-10 14:12:14 +03:00
Avi Kivity	4eb220d3ab	service: tablet_allocator: avoid large contiguous vector in make_repair_plan() make_repair_plan() allocates a temporary vector which can grow larger than our 128k basic allocation unit. Use a chunked vector to avoid stalls due to large allocations. Fixes #24713. Closes scylladb/scylladb#24801 (cherry picked from commit `0138afa63b`) Closes scylladb/scylladb#24902	2025-07-10 12:41:35 +03:00
Patryk Jędrzejczak	c9de7d68f2	Merge '[Backport 2025.3] Make it easier to debug stuck raft topology operation.' from Scylladb[bot] The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations. Backport since we want to have in the production as quick as possible. Fixes #24860 - (cherry picked from commit `c8ce9d1c60`) - (cherry picked from commit `4e6369f35b`) Parent PR: #24799 Closes scylladb/scylladb#24881 * https://github.com/scylladb/scylladb: topology coordinator: log a start and an end of topology coordinator command execution at info level topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc	2025-07-09 12:55:48 +02:00
Piotr Dulikowski	b535f44db2	Merge '[Backport 2025.3] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot] When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason. Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation. backport to relevant versions since the issue can cause node shutdown to hang Fixes scylladb/scylladb#24599 - (cherry picked from commit `8d48b27062`) - (cherry picked from commit `fc5ba4a1ea`) - (cherry picked from commit `7150632cf2`) - (cherry picked from commit `74a3fa9671`) - (cherry picked from commit `a9b476e057`) - (cherry picked from commit `d7af26a437`) Parent PR: #24595 Closes scylladb/scylladb#24882 * github.com:scylladb/scylladb: test: test_batchlog_manager: batchlog replay includes cdc test: test_batchlog_manager: test batch replay when a node is down batchlog_manager: set timeout on writes batchlog_manager: abort writes on shutdown batchlog_manager: create cancellable write response handler storage_proxy: add write type parameter to mutate_internal	2025-07-08 12:35:55 +02:00
Michael Litvak	ec1dd1bf31	test: test_batchlog_manager: batchlog replay includes cdc Add a new test that verifies that when replaying batch mutations from the batchlog, the mutations include cdc augmentation if needed. This is done in order to verify that it works currently as expected and doesn't break in the future. (cherry picked from commit `d7af26a437`)	2025-07-08 06:25:36 +00:00
Michael Litvak	7b30f487dd	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches. (cherry picked from commit `a9b476e057`)	2025-07-08 06:25:36 +00:00
Michael Litvak	c3c489d3d4	batchlog_manager: set timeout on writes Set a timeout on writes of replayed batches by the batchlog manager. We want to avoid having infinite timeout for the writes in case it gets stuck for some unexpected reason. The timeout is set to be high enough to allow any reasonable write to complete. (cherry picked from commit `74a3fa9671`)	2025-07-08 06:25:36 +00:00
Michael Litvak	6fb6bb8dc7	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type. (cherry picked from commit `7150632cf2`)	2025-07-08 06:25:36 +00:00
Michael Litvak	02c038efa8	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation. (cherry picked from commit `fc5ba4a1ea`)	2025-07-08 06:25:36 +00:00
Michael Litvak	d3175671b7	storage_proxy: add write type parameter to mutate_internal Currently mutate_internal has a boolean parameter `counter_write` that indicates whether the write is of counter type or not. We replace it with a more general parameter that allows to indicate the write type. It is compatible with the previous behavior - for a counter write, the type COUNTER is passed, and otherwise a default value will be used as before. (cherry picked from commit `8d48b27062`)	2025-07-08 06:25:36 +00:00
Gleb Natapov	4651c44747	topology coordinator: log a start and an end of topology coordinator command execution at info level Those calls a relatively rare and the output may help to analyze issues in production. (cherry picked from commit `4e6369f35b`)	2025-07-08 06:24:22 +00:00
Gleb Natapov	0e67f6f6c2	topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc The topology coordinator executes several topology cmd rpc against some nodes during a topology change. A topology operation will not proceed unless rpc completes (successfully or not), but sometimes it appears that it hangs and it is hard to tell on which nodes it did not complete yet. Introduce new REST endpoint that can help with debugging such cases. If executed on the topology coordinator it returns currently running topology rpc (if any) and a list of nodes that did not reply yet. (cherry picked from commit `c8ce9d1c60`)	2025-07-08 06:24:21 +00:00
Avi Kivity	859d9dd3b1	Merge '[Backport 2025.3] Improve background disposal of tablet_metadata' from Scylladb[bot] As seen in #23284, when the tablet_metadata contains many tables, even empty ones, we're seeing a long queue of seastar tasks coming from the individual destruction of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`. This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects on their owner shard by sorting them into vectors, per- owner shard. Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the contained tablet_metadata would be cleared gently. Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom and verify that it is gone with this change. Fixes #24814 Refs #23284 This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables. * Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards - (cherry picked from commit `3acca0aa63`) - (cherry picked from commit `493a2303da`) - (cherry picked from commit `e0a19b981a`) - (cherry picked from commit `2b2cfaba6e`) - (cherry picked from commit `2c0bafb934`) - (cherry picked from commit `4a3d14a031`) - (cherry picked from commit `6e4803a750`) Parent PR: #24618 Closes scylladb/scylladb#24864 * github.com:scylladb/scylladb: token_metadata_impl: clear_gently: release version tracker early test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables token_metadata: clear_and_destroy_impl when destroyed token_metadata: keep a reference to shared_token_metadata token_metadata: move make_token_metadata_ptr into shared_token_metadata class replica: database: get and expose a mutable locator::shared_token_metadata locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction	2025-07-07 14:02:19 +03:00
Gleb Natapov	a25bd068bf	topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet Old nodes do not expect global topology request names to be in request_type field, so set it only if a cluster is fully upgraded already. Closes scylladb/scylladb#24731 (cherry picked from commit `ca7837550d`) Closes scylladb/scylladb#24833	2025-07-07 11:50:55 +02:00
Benny Halevy	9bc487e79e	token_metadata_impl: clear_gently: release version tracker early No need to wait for all members to be cleared gently. We can release the version earlier since the held version may be awaited for in barriers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `6e4803a750`)	2025-07-07 09:42:29 +03:00
Benny Halevy	41dc86ffa8	test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables Reproduces #23284 Currently skipped in release mode since it requires the `short_tablet_stats_refresh_interval` interval. Ref #24641 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `4a3d14a031`)	2025-07-07 09:42:26 +03:00
Benny Halevy	f78a352a29	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2c0bafb934`)	2025-07-07 09:38:17 +03:00
Benny Halevy	b647dbd547	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2b2cfaba6e`)	2025-07-07 09:34:10 +03:00
Benny Halevy	0e7d3b4eb9	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-07 09:30:01 +03:00
Benny Halevy	c8043e05c1	replica: database: get and expose a mutable locator::shared_token_metadata Prepare for next patch, the will use this shared_token_metadata to make mutable_token_metadata_ptr:s Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `493a2303da`)	2025-07-07 09:27:06 +03:00
Benny Halevy	54fb9ed03b	locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction Sort all tablet_map_ptr:s by shard_id and then destroy them on each shard to prevent long cross-shard task queues for foreign_ptr destructions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `3acca0aa63`)	2025-07-07 09:27:01 +03:00
Avi Kivity	f60c54df77	storage_proxy: avoid large allocation when storing batch in system.batchlog Currently, when computing the mutation to be stored in system.batchlog, we go through data_value. In turn this goes through `bytes` type (#24810), so it causes a large contiguous allocation if the batch is large. Fix by going through the more primitive, but less contiguous, atomic_cell API. Fixes #24809. Closes scylladb/scylladb#24811 (cherry picked from commit `60f407bff4`) Closes scylladb/scylladb#24846	2025-07-05 00:37:09 +03:00
Patryk Jędrzejczak	f1ec51133e	docs: handling-node-failures: fix typo Replacing "from" is incorrect. The typo comes from recently merged #24583. Fixes #24732 Requires backport to 2025.2 since #24583 has been backported to 2025.2. Closes scylladb/scylladb#24733 (cherry picked from commit `fa982f5579`) Closes scylladb/scylladb#24832	2025-07-04 19:35:00 +02:00
Jenkins Promoter	648fe6a4e8	Update ScyllaDB version to: 2025.3.0-rc1	2025-07-03 11:35:01 +03:00
Michał Chojnowski	1bd536a228	utils/alien_worker: fix a data race in submit() We move a `seastar::promise` on the external worker thread, after the matching `seastar::future` was returned to the shard. That's illegal. If the `promise` move occurs concurrently with some operation (move, await) on the `future`, it becomes a data race which could cause various kinds of corruption. This patch fixes that by keeping the promise at a stable address on the shard (inside a coroutine frame) and only passing through the worker. Fixes #24751 Closes scylladb/scylladb#24752 (cherry picked from commit `a29724479a`) Closes scylladb/scylladb#24780	2025-07-03 10:45:51 +03:00
Avi Kivity	d5b11098e8	repair: row_level: unstall to_repair_rows_on_wire() destroying its input to_repair_rows_on_wire() moves the contents of its input std::list and is careful to yield after each element, but the final destruction of the input list still deals with all of the list elements without yielding. This is expensive as not all contents of repair_row are moved (_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>). To fix, destroy each row element as we move along. This is safe as we own the input and don't reference row_list other than for the iteration. Fixes #24725. Closes scylladb/scylladb#24726 (cherry picked from commit `6aa71205d8`) Closes scylladb/scylladb#24771	2025-07-03 10:44:58 +03:00
Tomasz Grabiec	775916132e	Merge '[Backport 2025.3] repair: postpone repair until topology is not busy ' from Scylladb[bot] Currently, repair_service::repair_tablets starts repair if there is no ongoing tablet operations. The check does not consider global topology operations, like tablet resize finalization. Hence, if: - topology is in the tablet_resize_finalization state; - repair starts (as there is no tablet transitions) and holds the erm; - resize finalization finishes; then the repair sees a topology state different than the actual - it does not see that the storage groups were already split. Repair code does not handle this case and it results with on_internal_error. Start repair when topology is not busy. The check isn't atomic, as it's done on a shard 0. Thus, we compare the topology versions to ensure that the business check is valid. Fixes: https://github.com/scylladb/scylladb/issues/24195. Needs backport to all branches since they are affected - (cherry picked from commit `df152d9824`) - (cherry picked from commit `83c9af9670`) Parent PR: #24202 Closes scylladb/scylladb#24783 * github.com:scylladb/scylladb: test: add test for repair and resize finalization repair: postpone repair until topology is not busy	2025-07-02 13:17:08 +02:00
Calle Wilund	46e3794bde	encryption_at_rest_test: Add exception handler to ensure proxy stop If boost test is run such that we somehow except even in a test macro such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy used, causing a use after free. (cherry picked from commit `8d37e5e24b`)	2025-07-02 10:13:08 +00:00
Calle Wilund	b7a82898f0	encryption: Ensure stopping timers in provider cache objects utils::loading cache has a timer that can, if we're unlucky, be runnnig while the encryption context/extensions referencing the various host objects containing them are destroyed in the case of unit testing. Add a stop phase in encryption context shutdown closing the caches. (cherry picked from commit `ee98f5d361`)	2025-07-02 10:13:08 +00:00
Jenkins Promoter	76bf279e0e	Update pgo profiles - aarch64	2025-07-02 13:06:18 +03:00
Jenkins Promoter	61364624e3	Update pgo profiles - x86_64	2025-07-02 12:34:58 +03:00
Botond Dénes	6e6c00dcfe	docs: cql/types.rst: remove reference to frozen-only UDTs ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing this limitation in the current docs. Replace the description of the limitation with general description of frozen semantics for UDTs. Fixes: #22929 Closes scylladb/scylladb#24763 (cherry picked from commit `37ef9efb4e`) Closes scylladb/scylladb#24784	2025-07-02 12:11:25 +03:00
Aleksandra Martyniuk	c26eb8ef14	test: add test for repair and resize finalization Add test that checks whether repair does not start if there is an ongoing resize finalization. (cherry picked from commit `83c9af9670`)	2025-07-01 20:26:53 +00:00
Aleksandra Martyniuk	8a1d09862e	repair: postpone repair until topology is not busy Currently, repair_service::repair_tablets starts repair if there is no ongoing tablet operations. The check does not consider global topology operations, like tablet resize finalization. This may cause a data race and unexpected behavior. Start repair when topology is not busy. (cherry picked from commit `df152d9824`)	2025-07-01 20:26:53 +00:00
Yaron Kaikov	e64bb3819c	Update ScyllaDB version to: 2025.3.0-rc0	2025-07-01 10:34:39 +03:00
Anna Stuchlik	b61641cf57	doc: remove support for Ubuntu 20.04 Fixes https://github.com/scylladb/scylladb/issues/24564 Closes scylladb/scylladb#24565	2025-06-30 12:33:29 +02:00
Nadav Har'El	7db5e9a3e9	test/cqlpy: reproducer for decimal parsing with very high exponent This patch adds tests reproducing issue #24581, where Scylla incorrectly parsed "decimal"-type literals in CQL with very high exponents, near or above the 32-bit limit. For example, 1.1234e-2147483647 was incorrectly read as 1.1234E+2147483649, while it should be (as we explain in comments in the test) an error. The tests in this patch failed (in multiple checks) before #24581 was fixed, and pass after it was fixed. These tests all pass on Cassandra 3, confirming our understanding on the limits of "decimal" to be correct. But they fail on Cassandra 4 and 5 due to a regression https://issues.apache.org/jira/browse/CASSANDRA-20723 in Cassandra, that mistakenly limited "decimal" exponents to just 309. Refs #24581 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24646	2025-06-30 10:37:13 +03:00
Anna Stuchlik	b7683d0eba	doc: remove duplicated content This commit removes the Non-Reserved CQL Keywords and Reserved CQL Keywords pages-keyword as that content is already covered on the Appendices page. Redirections are added to avoid 404s for the removed pages. In addition, the Appendices page title is extended with "Reserved CQL Keywords and Types" to help users understand what those appendices are about. Fixes https://github.com/scylladb/scylladb/issues/24319 Closes scylladb/scylladb#24320	2025-06-30 10:30:13 +03:00
Botond Dénes	ee6d7c6ad9	test/boost/memtable_test: only inject error for test table Currently the test indiscriminately injects failures into the flushes of any table, via the IO extension mechanism. The tests want to check that the node correctly handles the IO error by self isolating, however the indiscriminate IO errors can have unintended consequences when they hit raft, leading to disorderly shutdown and failure of the tests. Testing raft's resiliency to IO errors if of course worth doing, but it is not the goal of this particular test, so to avoid the fallout, the IO errors are limited to the test tables only. Fixes: https://github.com/scylladb/scylladb/issues/24637 Closes scylladb/scylladb#24638	2025-06-30 10:08:49 +03:00
Avi Kivity	07c5edcc30	tools: add patchelf utility We use patchelf to rewrite the dynamic loader (known as the interpreter) of the binaries we ship, so we can point to our shipped dynamic loader, which is compatible with our binaries, rather than rely on the distribution's dynamic loader, which is likely to be incompatible. Upstream patchelf losing compatibity [1] with Linux 5.17 and below. This change was also picked up by Fedora 42, so we cannot update the toolchain to that distribution until we have an alternative. Here we add a minimal patchelf alternative. It was mostly written by Claude. It is minimal in that it only supports --set-interpreter and --print-interpreter, and works well enough for our needs. We still use the original patchelf for --remove-rpath; this reduces our maintenance needs. [1] `43b75fbc9f` [2] `4b015255d1` Closes scylladb/scylladb#24695	2025-06-30 07:24:05 +03:00
Avi Kivity	e2cda38b0f	Merge 'alternator: improve, document and test table/index name lengths' from Nadav Har'El Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations: 1. A table name must be up to 222 characters. 2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters. 3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters. The first patch documents these existing limitations, improves their testing, and fixes a tiny bug found by one of the tests (where UpdateTable adding a GSI's limit testing is off by one). The second patch unfortunately shows with a reproducer (issue #24598) this limit of 222 is problematic and we may need to lower it: If a user creates a table of length 222 and then enables Alternator streams, Scylla shuts down on an IO error. This will need to be fixed later, but at least this patch properly documents the existing behavior. No need to backport this patch - it is a very minor improvement that it is unlikely users care about and there is no potential for harm. Closes scylladb/scylladb#24597 * github.com:scylladb/scylladb: test/alternator: reproducer for streams bug with long table name alternator: improve, document and test table/index name lengths	2025-06-29 18:53:48 +03:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Avi Kivity	48d9f3d2e3	Merge 'mutation: check key of inserted rows' from Botond Dénes Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. Fixes: https://github.com/scylladb/scylladb/issues/24506 Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters. Closes scylladb/scylladb#24497 * github.com:scylladb/scylladb: mutation: check key of inserted rows compound: optimize is_full() for single-component types	2025-06-29 18:10:17 +03:00
Pavel Emelyanov	ef396ecf7a	api: Reserve resulting vector with schema versions The get_schema_versions handler gets unordered_map from storage service, then converts it to API returning type, which is a vector. This vector can be reserved, the final number of elements is known in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24715	2025-06-29 14:37:45 +03:00
Nadav Har'El	50d370f06e	test/alternator: reproducer for streams bug with long table name The two tests in this patch reproduce issue #24598: When enabling Alternator streams on an Alternator table with a very long name, such as the maximum allowed name length 222, the result is an I/O error and a Scylla shutdown. The two tests are currently marked "skip", otherwise they would crash the Scylla being tested. Refs #24598 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-29 11:40:55 +03:00
Nadav Har'El	0ce0b2934f	alternator: improve, document and test table/index name lengths Whereas DynamoDB limits the names of tables, LSIs and GSIs to 255 characters each, Alternator currently has different (and lower) limitations: 1. A table name must be up to 222 characters. 2. For a GSI, the sum of the table's and GSI's name length, plus 1, must be up to 222 characters. 3. For an LSI, the sum of the table's and LSI's name length, plus 2, must be up to 222 characters. These specific limitations were never documented, so in this patch we add this information to docs/alternator/compatibility.md. Moreover, these limitations where only partially tested, so in this patch we add testing for more cases that we forgot to check - such as length of LSI names (only GSI were checked before this patch), or adding a GSI to an existing table. It is important to check all these corner cases because there is a risk that if we attempt to create a table without checking its length, we can end up with an I/O error that brings down Scylla. In one case - UpdateTable adding a GSI to an existing table - the new test exposed a trivial bug: Because UpdateTable wants to verify the new GSI doesn't have the same name as an existing LSI, it mistakenly applied the LSI's length name limit instead of the GSI's name length limit, which is one byte less than it should be. So this patch fixes this trivial bug as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-29 11:40:55 +03:00
Emil Maskovsky	c6307aafd5	test.py: handle cancellation gracefully to avoid TypeError Previously, if test execution was cancelled, `run_all_tests()` could return `None`. This caused a `TypeError` when the result was unconditionally unpacked into `total_tests_pytest, failed_pytest_tests`. This commit updates the code to handle the cancellation appropriately, preventing the confusing `TypeError` exception and ensuring clean cancellation behavior. Closes scylladb/scylladb#24624	2025-06-27 20:14:35 +03:00
Pavel Emelyanov	23d86ede72	Merge 'audit: introduce debug level logs on happy path' from Dario Mirovic Audit component defines `audit` logger which it uses only for `error` and `info` logs, regarding `audit` module initialization and errors during audit log writing. This change introduces `debug` level logs on the happy path of audit log writes. Fixes: https://github.com/scylladb/scylladb/issues/23773 No backport needed - this is a small quality-of-life improvement. Closes scylladb/scylladb#24658 * github.com:scylladb/scylladb: audit: change audit test logger level to `debug` audit: introduce debug level logs on happy path	2025-06-27 20:10:54 +03:00
Anna Stuchlik	2367330513	doc: remove OSS mention from the SI notes This commit removes a confusing reference to an Open Source version form the Local Secondary Indexes page. Fixes https://github.com/scylladb/scylladb/issues/24668 Closes scylladb/scylladb#24673	2025-06-27 20:07:51 +03:00
Anna Stuchlik	7537f5f260	doc: fix the headings in the Admin Guide This commit fixes incorrect headings in the Admin Guide and the files that are included in that guide. The purpose is to properly organize the content and improve the search, as well as prevent potential build problems caused by a poor heading organization. Fixes https://github.com/scylladb/scylladb/issues/24441 Closes scylladb/scylladb#24700	2025-06-27 20:07:09 +03:00
Dario Mirovic	ec6249b581	audit: change audit test logger level to `debug` Audit module tests should show the `debug` level messages. This change makes audit_test.py `audit` module log level to `debug`. Closes scylladb/scylladb#23773	2025-06-27 16:27:33 +02:00
Dario Mirovic	666364f651	audit: introduce debug level logs on happy path Audit component defines `audit` logger which it uses only for `error` and `info` logs, regarding `audit` module initialization and errors during audit log writing. This change introduces `debug` level logs on the happy path of audit log writes. Ref: scylladb/scylladb#23773	2025-06-27 16:27:27 +02:00
Botond Dénes	495f607e73	test/cluster/test_read_repair: write 100 rows in trace test This test asserts that a read repair really happened. To ensure this happens it writes a single partition after enabling the database_apply error injection point. For some reason, the write is sometimes reordered with the error injection and the write will get replicated to both nodes and no read repair will happen, failing the test. To make the test less sensitive to such rare reordering, add a clustering column to the table and write a 100 rows. The chance of all 100 of them being reordered with the error injection should be low enough that it doesn't happen again (famous last words). Fixes: #24330 Closes scylladb/scylladb#24403	2025-06-27 16:23:08 +03:00
Pavel Emelyanov	4c0154f156	Merge 'test.py: enhance allure reporting' from Andrei Chekun Add run ID for process output file to be not overwritten in the next case: first run failed, second passed. They are using the same name, so the second run will overwrite and delete the file. This will help to investigate in case of C++ test fails Add attaching Scylla log files to allure report in case test failed. This is an alternative for link in JUnit report that exists in CI. That change will help to investigate the cluster tests fails. Example can be found in the failed [job](https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/2980/allure/). Backport is not needed, this is only framework enhancements Closes scylladb/scylladb#24677 * github.com:scylladb/scylladb: test.py: Attach node logs in allure report in case of fail test.py: Add run id to the boost output file	2025-06-27 16:22:03 +03:00
Botond Dénes	e715a150b9	tools/scylla-nodetool: backup: add --move-files parameter Allow opting in for backup to move the files instead of copying them. Fixes: https://github.com/scylladb/scylladb/issues/24372 Closes scylladb/scylladb#24503	2025-06-27 16:21:39 +03:00
Piotr Dulikowski	9d70e7a067	Merge 'docs: document the new recovery procedure' from Patryk Jędrzejczak We replace the documentation of the old recovery procedure with the documentation of the new recovery procedure. The new recovery procedure requires the Raft-based topology to be enabled, so to remove the old procedure from the documentation, we must assume users have the Raft-based topology enabled. We can do it in 2025.2 because the upgrade guides to 2025.1 state that enabling the Raft-based topology is a mandatory step of the upgrade. Another reminder is the upgrade guides to 2025.2. Since we rely on the Raft-based topology being enabled, we remove the obsolete parts of the documentation. We will make the Raft-based topology mandatory in the code in the future, hopefully in 2025.3. For this reason, we also don't touch the dev docs in this PR. Fixes scylladb/scylladb#24530 Requires backport to 2025.2 because 2025.2 contains the new recovery procedure. Closes scylladb/scylladb#24583 * github.com:scylladb/scylladb: docs: rely on the Raft-based topology being enabled docs: handling-node-failures: document the new recovery procedure	2025-06-26 17:07:37 +02:00
Gleb Natapov	5f953eb092	storage_proxy: retry paxos repair even if repair write succeeded After paxos state is repaired in begin_and_repair_paxos we need to re-check the state regardless if write back succeeded or not. This is how the code worked originally but it was unintentionally changed when co-routinized in `61b2e41a23`. Fixes #24630 Closes scylladb/scylladb#24651	2025-06-26 17:06:02 +02:00
Andrei Chekun	2c726c5074	test.py: Attach node logs in allure report in case of fail Currently, allure report have no nodes logs in case of fail, this will allow to view the logs in one place without going anywhere else.	2025-06-26 15:37:33 +02:00
Piotr Dulikowski	2f7ed8b1d4	Merge 'Fix for cassandra role gets recreated after DROP ROLE' from Marcin Maliszkiewicz This patchset fixes regression introduced by `7e749cd848` when we started re-creating default superuser role and password from the config, even if new custom superuser was created by the user. Now we'll check, first with CL LOCAL_ONE if there is a need to create default superuser role or password, confirm it with CL QUORUM and only then atomically create role or password. If server is started without cluster quorum we'll skip creating role or password. Fixes https://github.com/scylladb/scylladb/issues/24469 Backport: all versions since 2024.2 Closes scylladb/scylladb#24451 * github.com:scylladb/scylladb: test: auth_cluster: add test for password reset procedure auth: cache roles table scan during startup test: auth_cluster: add test for replacing default superuser test: pylib: add ability to specify default authenticator during server_start test: pylib: allow rolling restart without waiting for cql auth: split auth-v2 logic for adding default superuser password auth: split auth-v2 logic for adding default superuser role auth: ldap: fix waiting for underlying role manager auth: wait for default role creation before starting authorizer and authenticator	2025-06-26 14:36:25 +02:00
Lakshmi Narayanan Sreethar	279253ffd0	utils/big_decimal: fix scale overflow when parsing values with large exponents The exponent of a big decimal string is parsed as an int32, adjusted for the removed fractional part, and stored as an int32. When parsing values like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32 limit, and since the scale is stored as an int32, it overflows and wraps around, losing the value. This patch fixes that the by parsing the exponent as an int64 value and then adjusting it for the fractional part. The adjusted scale is then checked to see if it is still within int32 limits before storing. An exception is thrown if it is not within the int32 limits. Note that strings with exponents that exceed the int32 range, like `0.01E2147483650`, were previously not parseable as a big decimal. They are now accepted if the final adjusted scale fits within int32 limits. For the above value, unscaled_value = 1 and scale = -2147483648, so it is now accepted. This is in line with how Java's `BigDecimal` parses strings. Fixes: #24581 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24640	2025-06-26 15:29:28 +03:00
Patryk Jędrzejczak	203ea5d8f9	docs: rely on the Raft-based topology being enabled In 2025.2, we don't force enabling the Raft-based topology in the code, but we stated in the upgrade guides that it's a mandatory step of the upgrade to 2025.1. We also remind users to enable the Raft-based topology in the upgrade guides to 2025.2. Hence, we can rely in the the documentation on the Raft-based topology being enabled. If it is still disabled, we can just send the user to the upgrade guides. Hence: - we remove all documentation related to enabling the Raft-based topology, enabling the Raft-based schema (enabled Raft-based topology implies enabled Raft-based schema), and the gossip-based topology, - we can replace the documentation of the old manual recovery procedure with the documentation of the new manual recovery procedure (done in the previous commit).	2025-06-26 14:17:54 +02:00
Patryk Jędrzejczak	4e256182a0	docs: handling-node-failures: document the new recovery procedure We replace the documentation of the old recovery procedure with the documentation of the new recovery procedure. We can get rid of the old procedure from the documentation because we requested users to enable the Raft-based topology during upgrades to 2025.1 and 2025.2. We leave the note that enabling the Raft-based topology is required to use the new recovery procedure just in case, since we didn't force enabling the Raft-based topology in the code.	2025-06-26 14:17:50 +02:00
Andrei Chekun	156e7d2e7a	test.py: Add run id to the boost output file To avoid overwriting the output tests adding the run id to it. Previously, when first repeat failed and the second passes, because the are using the same name for the output, it will be overwritten and deleted since the second repeat passed	2025-06-26 12:51:15 +02:00
Marcin Maliszkiewicz	5e7ac34822	test: auth_cluster: add test for password reset procedure	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	0ffddce636	auth: cache roles table scan during startup It may be particularly beneficial during connection storms on startup. In such cases, it can happen that none of the user's read requests succeed, preventing the cache from being populated. This, in turn, makes it more difficult for subsequent reads to succeed, reducing resiliency against such storms.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	67a4bfc152	test: auth_cluster: add test for replacing default superuser This test demonstrates creating custom superuser guide: https://opensource.docs.scylladb.com/stable/operating-scylla/security/create-superuser.html	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	a3bb679f49	test: pylib: add ability to specify default authenticator during server_start Sometimes we may not want to use default cassandra role for control connection, especially when we test dropping default role.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	d9ec746c6d	test: pylib: allow rolling restart without waiting for cql Waiting for CQL requires default superuser being present in db. In some cases we may delete it and still want to do rolling restart. Additionally if we need CQL we may want to wait after restart is complete (once, and not for each node).	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	f85d73d405	auth: split auth-v2 logic for adding default superuser password In raft mode (auth-v2) we need to do atomic write after read as we give stricter consistency guarantees. Instead of patching legacy logic this commit adds different path as: - old code may be less tested now so it's best to not change it - new code path avoids quorum selects in a typical flow (passwords set) There may be a case when user deletes a superuser or password right before restarting a node, in such case we may ommit updating a password but: - this is a trade-off between quorum reads on startup - it's far more important to not update password when it shouldn't be - if needed password will be updated on next node restart If there is no quorum on startup we'll skip creating password because we can't perform any raft operation. Additionally this fixes a problem when password is created despite having non default superuser in auth-v2.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	2e2ba84e94	auth: split auth-v2 logic for adding default superuser role In raft mode (auth-v2) we need to do atomic write after read as we give stricter consistency guarantees. Instead of patching legacy logic this commit adds different path as: - old code may be less tested now so it's best to not change it - new code path avoids quorum selects in a typical flow (roles set) This fixes a problem when superuser role is created despite having non default superuser in auth-v2. If there is no quorum on startup we'll skip creating role because we can't perform any raft operation.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	c96c5bfef5	auth: ldap: fix waiting for underlying role manager ldap_role_manager depends on standard_role_manager, therefore it needs to wait for superuser initialization. If this is missing, the password authenticator will start checking the default password too early and may fail to create the default password if there is no default role yet. Currently password authenticator will create password together with the role in such case but in following commits we want to separate those responsibilities correctly.	2025-06-26 12:28:08 +02:00
Marcin Maliszkiewicz	68fc4c6d61	auth: wait for default role creation before starting authorizer and authenticator There is a hidden dependency: the creation of the default superuser role is split between the password authenticator and the role manager. To work correctly, they must start in the right order: role manager first, then password authenticator.	2025-06-26 12:28:08 +02:00
Piotr Dulikowski	62efe6616a	Merge 'mapreduce: add tablet-aware dispatching algorithm' from Andrzej Jackowski The primary motivation for this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB is generally transitioning towards tablets, and using tablets simplifies work dispatching, the decision was made to design the new algorithm specifically for tablets. The goal of the algorithm is to divide the work in such a way that each `tablet_replica` (that is <host, shard> pair) processes two tablets at a time. The new algorithm can be summarized as follows: 1. Prepare a tablet_replica -> partition_range mapping where the values cover the entire space. 2. For each tablet_replica, in parallel, take two partition ranges and dispatch them to the node hosting the replica. The ERM is released and re-acquired in each iteration, allowing the destination (i.e., tablet_replica) to change for each artition range (in such cases, the partition range is assigned to the appropriate tablet_replica). In step 1, the main difference compared to the old algorithm (dispatch_to_vnodes) is that partition ranges are assigned to a tablet_replica rather than just the host. In step 2, the main difference is that the work is divided into smaller batches, and the ERM is released and re-acquired for each batch. In the current implementation, each node can correctly handle every partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because mapreduce_service::execute_on_this_shard creates a new pager that coordinates the partition range read, including obtaining its own ERM. However, every partition range that is absent locally is handled by shard 0. Therefore, proper routing of partition ranges is necessary to avoid shard 0 overload. This is why, in step 2, the ERM is retained during each batch processing, and the tablet_replica is refreshed for each processed range. Additionally, shard_id is added to mapreduce request. When shard_id is set, the entire partition range is handled by the specified shard. As the new tablet-aware mapreduce algorithm balances the workload across shards, shard_id ensure that the balance is preserved, even during events such as tablet splits. This patch series: - Refactors a bit mapreduce service, to facilitate having two algorithm versions (one for vnodes and one for tablets). - Implements tablet-aware dispatching algorithm. - Adds shard_id to mapreduce request and uses the information to handle requests entirely by selected shard. - Adds test_long_query_timeout_erm to verify the new functionality. Fixes: scylladb#21831 No backport, as it is rather new feature than a bugfix. Closes scylladb/scylladb#24383 * github.com:scylladb/scylladb: mapreduce: add missing comma and space in mapreduce_request operator<< mapreduce: add shard_id_hint to mapreduce request test: add test_long_query_timeout_erm mapreduce: add tablet-aware dispatching algorithm storage_proxy: make storage_proxy::is_alive public mapreduce: remove _shared_token_metadata from mapreduce_service mapreduce: move dispatching logic to dispatch_to_vnodes mapreduce: remove underscores from variable names mapreduce: move req_with_modified_pr handling to a new function mapreduce: change next_vnode lambda to get_next_partition_range function	2025-06-26 12:25:39 +02:00
Avi Kivity	947906e6fd	Merge 'Make uuid sstable generations mandatory' from Benny Halevy Before we can eradicate the numerical sstable generations, This series completes https://github.com/scylladb/scylladb/issues/20337 by disabling the use of numerical sstable generations where we can and making sure the feature is never disabled. Note that until the cluster feature is enabled in the startup process on first boot, numerical generation might be used for local system tables. Refs #24248 * Enhancement. No backport required Closes scylladb/scylladb#24554 * github.com:scylladb/scylladb: feature_service: never disable UUID_SSTABLE_IDENTIFIERS test: sstable_move_test: always use uuid sstable generation test: sstable_directory_test: always use uuid sstable generation sstables: sstable_generation_generator: set last_generation=0 by default test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation test: lib: test_env: always use uuid sstable generation test: sstable_test: always use uuid sstable generation test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation test: sstable_compaction_test: always use uuid sstable generation	2025-06-26 12:25:38 +02:00
Szymon Malewski	f28bab741d	utils/exceptions.cc: Added check for `exceptions::request_timeout_exception` in `is_timeout_exception` function. It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure. Fixes #24591 Closes scylladb/scylladb#24619	2025-06-26 12:25:38 +02:00
Pavel Emelyanov	0f5b358c47	test: Use test sched groups, not database ones Some tests want to switch between sched groups. For that there's cql-test-env facility to create and use them. However, there's a test that uses replica::database as sched groups provider, which is not nice. Fix it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24615	2025-06-26 12:25:38 +02:00
Avi Kivity	ff508ce82c	Merge 'sstables: purge SCYLLA_ASSERT from the sstable read/parse paths' from Botond Dénes Introduce `sstables::parse_assert()`, to replace `SCYLLA_ASSERT()` on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced `parse_assert()` uses `on_internal_error()` under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump. Fixes https://github.com/scylladb/scylladb/issues/20845 We just hit another case of `SCYLLA_ASSERT()` triggering due to corrupt sstables bringing down nodes in the field, should be backported to all releases, so we don't hit this in the future Closes scylladb/scylladb#24534 * github.com:scylladb/scylladb: sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path sstables/exceptions: introduce parse_assert()	2025-06-26 12:25:38 +02:00
Ferenc Szili	96267960f8	logging: Add row count to large partition warning message When writing large partitions, that is: partitions with size or row count above a configurable threshold, ScyllaDB outputs a warning to the log: WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db This warning contains the information about the size of the partition, but it does not contain the number of rows written. This can lead to confusion because in cases where the warning was written because of the row count being larger than the threshold, but the partition size is below the threshold, the warning will only contain the partition size in bytes, leading the user to believe the warning was output because of the partition size, when in reality it was the row count that triggered the warning. See #20125 This change adds a size_desc argument to cql_table_large_data_handler::try_record(), which will contain the description of the size of the object written. This method is used to output warnings for large partitions, row counts, row sizes and cell sizes. This change does not modify the warning message for row and cell sizes, only for partition size and row count. The warning for large partitions and row counts will now look like this: WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db Closes scylladb/scylladb#22010	2025-06-26 12:25:38 +02:00
Yaniv Michael Kaul	198ecd8039	Do not perform blkdiscard by default on the disks during RAID setup. This is not needed on clean disks, which is often the case with cloud instances, but can be useful on bare metal servers with disks that were used before. Therefore, the default is to skip blkdiscard operation, which makes overall installation faster. If the user wishes to run it anyway, use the newly introduced --blkdiscard option of scylla_raid_setup to perform it. Note: since we either perform online discard or schedule fstrim, the (previously used) space will gradually get trimmed, this way or another. Fixes: https://github.com/scylladb/scylladb/issues/24470 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#24579	2025-06-26 12:25:38 +02:00
Piotr Dulikowski	23f0d275c8	Merge 'generic_server: fix connections semaphore config observer' from Marcin Maliszkiewicz In `ed3e4f33fd` we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it. When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer. Backport: to 2025.2 where this feature was added Fixes: https://github.com/scylladb/scylladb/issues/24557 Closes scylladb/scylladb#24484 * github.com:scylladb/scylladb: test: add test for live updates of generic server config utils: don't allow do discard updateable_value observer generic_server: fix connections semaphore config observer	2025-06-26 12:25:38 +02:00
Andrzej Jackowski	ba6ed45d7f	mapreduce: add missing comma and space in mapreduce_request operator<< This change is introduced to fix the broken formating of mapreduce_request `operator<<`. Due to lack of ", " before "cmd" the output was `reductions=[...]cmd=read_command{...}` instead of `reductions=[...], cmd=read_command{...}`.	2025-06-25 19:23:07 +02:00
Andrzej Jackowski	26403df9ea	mapreduce: add shard_id_hint to mapreduce request If a partition range is not present locally, `partition_ranges_owned_by_this_shard` assigns it to shard 0, which can overload shard 0. To address this, this commit adds a `shard_id_hint` to the mapreduce request. When `shard_id_hint` is set, the entire partition range in the request is handled by the specified shard. The `shard_id_hint` is set by the new tablet-aware mapreduce algorithm, introduced in `dispatch_to_tablets`. This algorithm balances the workload across shards, so the changes in this commit ensure that load balancing is preserved, even during events such as tablet splits. Fixes: scylladb#21831	2025-06-25 19:23:07 +02:00
Andrzej Jackowski	5f31011111	test: add test_long_query_timeout_erm This test verifies the effectiveness of the mechanism for releasing ERM introduced in this patch series. In test scenario, during processing of a query in mapreduce service, reads are intentionally blocked by an injected error. However, when table uses tablets, ERM is now often released by the mapreduce service, so the topology is not blocked to the end of the request. As a result, it is possible to add a new node before the query finishes. Refs. scylladb#21831	2025-06-25 19:22:48 +02:00
Robert Bindar	6e7cab5b45	Add repository layout dev documentation This change adds an md file which gives a high level overview of the scylladb repository, the components each path contains and a basic description for each one of them. This is mainly intended for onboarding engineers to help get a mental picture when starting ramping up on Scylla concepts. Refs #22908 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23010	2025-06-25 13:58:05 +03:00
Patryk Jędrzejczak	cc8c618356	Merge 'LWT for tablets: fix paxos state for intranode migration' from Petr Gusev This PR fixes the "intra-node tablet migration" issue from the [LWT over tablets spec](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.uk3mizf7gvs1). We make `get_replica_lock` to acquire locks on both shards to avoid races. We also implement read_repair for paxos state -- if `load_paxos_state` returns different states on two shards, we 'repair' it by choosing the values with maximum timestamp and writing the 'repaired' state to both shards. LWT for tablets is not enabled yet. It requires migrating paxos state to colocated tablets, which is blocked on [this PR](https://github.com/scylladb/scylladb/pull/22906). Regarding testing: * We could possibly arrange a test case for the locking commit through some error injection magic. We'll return to this when LWT for tablets is enabled. * We can't think of a clear test case for the read_repair commit. Any suggestions are welcome (@gleb-cloudius). Backport: no need, since it's a new feature. Closes scylladb/scylladb#24478 * https://github.com/scylladb/scylladb: paxos_state: read repair for intranode_migration paxos_state: fix get_replica_lock for intranode_migration	2025-06-25 11:08:39 +02:00
Sergey Zolotukhin	0d7de90523	Fix regexp in `check_node_log_for_failed_mutations` The regexp that was added in https://github.com/scylladb/scylladb/pull/23658 does not work as expected: `TRACE`, `INFO` and `DEBUG` level messages are not ignored. This patch corrects the pattern to ensure those log levels are excluded. Fixes scylladb/scylladb#23688 Closes scylladb/scylladb#23889	2025-06-25 12:00:16 +03:00
Anna Stuchlik	592d45a156	doc: remove references to Open Source from README This commit removes the references to ScyllaDB Open Source from the README file for documentation. In addition, it updates the link where the documentation is currently published. We've removed Open Source from all the documentation, but the README was missed. This commit fixes that. Closes scylladb/scylladb#24477	2025-06-25 11:38:46 +03:00
Michał Chojnowski	cace55aaaf	test_sstable_compression_dictionaries_basic.py: fix a flaky check test_dict_memory_limit trains new dictionaries and checks (via metrics) that the old dictionaries are appropriately cleaned up. The problem is that the cleanup is asynchronous (because the lifetimes are handled by foreign_ptr, which sends the destructor call to the owner shard asynchronously), so the metrics might be checked a few milliseconds before the old dictionary is cleaned up. The dict lifetimes are lazy on purpose, the right thing to do is to just let the test retry the check. Fixes scylladb/scylladb#24516 Closes scylladb/scylladb#24526	2025-06-25 11:30:28 +03:00
Amnon Heiman	51cf2c2730	api/failure_detector.cc: stream endpoints Previously, get_all_endpoint_states accumulated all results in memory, which could lead to large allocations when dealing with many endpoints. This change uses the stream_range_as_array helper to stream the results. Fixes #24386 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#24405	2025-06-25 11:28:37 +03:00
Guy Shtub	71ba1f8bc9	docs: update third party driver list with Exandra Elixir driver Closes scylladb/scylladb#24260	2025-06-25 11:27:03 +03:00
Kefu Chai	e212b1af0c	build: add p11-kit's cflags to user_cflags instead of args.user_cflags Fix an issue introduced in commit `083f7353` where p11-kit's compiler flags were incorrectly added to `args.user_cflags` instead of `user_cflags`. This created the following problem: When using CMake generation mode, these flags were added to `CMAKE_CXX_FLAGS`, causing them to be passed to all compiler invocations including linking stages where they were irrelevant. This change moves p11-kit's cflags to `user_cflags`, which ensures the flags are correctly included in compilation commands but not in linking commands. This maintains the proper behavior in the ninja build system while fixing the issue in the CMake build system. `args.user_cflags` is preserved for its intended purpose of storing user-specified compiler flags passed via command line options. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23988	2025-06-25 11:24:09 +03:00
Andrzej Jackowski	ea2bdae45a	mapreduce: add tablet-aware dispatching algorithm The primary goal of this change is to reduce the time during which the Effective Replication Map (ERM) is retained by the mapreduce service. This ensures that long aggregate queries do not block topology operations. As ScyllaDB transitions towards tablets, which simplify work dispatching, the new algorithm is designed specifically for tablets. The algorithm divides work so that each `tablet_replica` (a <host, shard> pair) processes two tablets at a time. After processing of each `tablet_replica`, the ERM is released and re-acquired. The new algorithm can be summarized as follows: 1. Prepare a set of exclusive `partition_ranges`, where each range represents one tablet. This set is called `ranges_left`, because it contains ranges that still need processing. 2. Loop until `ranges_left` is empty: I. Create `tablet_replica` -> `ranges` mapping for the current ERM and `ranges_left`. Store this mapping and the number representing current ERM version as `ranges_per_replica`. II. In parallel, for each tablet_replica, iterate through ranges_per_tablet_replica. Select independently up to two ranges that are still existing in ranges_left. Remove each range selected for processing from ranges_left. Before each iteration, verify that ERM version has not changed. If it has, return to Step I. Steps I and II are exclusive to simplify maintaining `ranges_left` and `ranges_per_replica`: - Step I iterates through `ranges_left` and creates `ranges_per_replica` - Step II iterates through `ranges_per_replica` and remove processed ranges from `ranges_left` To maintain the exclusivity, the algorithm uses `parallel_for_each` in Step II, requiring all ongoing `tablet_replica` processing to finish before returning to Step I. Currently, each node can handle any partition range, even if the mapreduce supercoordinator does not retain the ERM and the range is absent locally. This is because `execute_on_this_shard` creates a new pager to coordinate the partition range read, including obtaining its own ERM. However, absent ranges are handled by shard 0, so proper routing is necessary to avoid overloading shard 0. Thus, in Step II, the ERM is retained during each `tablet_replica` processing. The tablet split scenario is not well-handled in this implementation. After a split, the entire pre-split range is sent to a node hosting the `tablet_replica` containing the range's `end_token`. The node will typically not have other tablets in the range, and as aforementioned, absent ranges are handled by shard 0. As a result, in such scenario, shard 0 handles a significant portion of the range. This issue is addressed later in this patch series by introducing `shard_id` in `mapreduce_request`. Ref. scylladb#21831	2025-06-25 10:18:02 +02:00
Kefu Chai	7d4dc12741	build: cmake: Use LINKER: prefix for consistent linker option handling Previously, we passed dynamic linker options like "-dynamic-linker=..." directly to the compiler driver with padded paths. This approach created inconsistency with the build commands generated by `configure.py`. This change implements a more consistent approach by: - Using the CMake "LINKER:" prefix to mark options that should be passed directly to the linker - Ensuring Clang properly receives these options via the `-Xlinker` flag The result is improved consistency between CMake-generated build commands and those created by `configure.py`, making the build system more maintainable and predictable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23987	2025-06-25 11:17:15 +03:00
Nadav Har'El	16c1365332	test,alternator: test server-side load balancing with zero-token node In issue #6527 it was suggested that a zero-token node (a.k.a coordinator- only node, or data-less node) could serve as a topology-aware Alternator load balancer - requests could be sent to it and they will be forwarded to the right node. This feature was implemented, but we never tested that it actually works for Alternator requests. So this patch tests this by starting a 5-node cluster with 4 regular nodes and one zero-token node, and testing that requests to the zero-token node work as expected. It is important to know that this feature does indeed work as expected, and also to have a regression test for it so the feature doesn't break in the future. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23114	2025-06-25 11:13:15 +03:00
Pablo Idiaquez	8137f34424	docs: troubleshooting/report-scylla-problem.rst: fix upload URL wrong url / hostname pointing to deprecated S3 bucket (we use GCP bucket now for uploads ) Fixes scylladb/scylladb#24639 Closes scylladb/scylladb#23533	2025-06-25 10:32:37 +03:00
Andrzej Jackowski	6d358cd7b2	storage_proxy: make storage_proxy::is_alive public The motivation is to allow other components (specifically mapreduce service) to use the method, just as storage_proxy::get_live_endpoints.	2025-06-25 08:59:04 +02:00
Andrzej Jackowski	9dbb1468b4	mapreduce: remove _shared_token_metadata from mapreduce_service Before this change, `mapreduce_service` used `_shared_token_metadata` to get the topology. However, the token was used in a part of the code that already had its own ERM with its own metadata token. Moreover, as mapreduce_service's token and ERM's token are not guaranteed to be the same, inconsistencies could occur. Therefore, this commit removes `_shared_token_metadata` and its usage.	2025-06-25 08:42:16 +02:00
Andrzej Jackowski	94ce5a0ed6	mapreduce: move dispatching logic to dispatch_to_vnodes This commit moves the current dispatching logic of the mapreduce service to a new dispatch_to_vnodes function. The moved code was written before tablets were introduced, and although it works with tablets, the variable naming still refers to vnodes (e.g., vnodes_per_addr, vnodes_generator). The motivation for this change is that later in this patch series, a new algorithm for tablets is introduced, and both algorithms need to coexist. Ref. scylladb#21831	2025-06-25 08:42:03 +02:00
Andrzej Jackowski	48aced87f5	mapreduce: remove underscores from variable names This commit removes unnecessary underscores from tr_state_ and dispatcher_ variable names, that were left after moving code to a separate function in the previous commit.	2025-06-25 08:41:21 +02:00
Andrzej Jackowski	d238a2f73e	mapreduce: move req_with_modified_pr handling to a new function The motivation for this change is to enable code reuse when a new implementation of the mapreduce algorithm for tablets is introduced later in this patch series. Ref. scylladb#21831	2025-06-25 08:40:02 +02:00
Aleksandra Martyniuk	0deb9209a0	test: rest_api: fix test_repair_task_progress test_repair_task_progress checks the progress of children of root repair task. However, nothing ensures that the children are already created. Wait until at least one child of a root repair task is created. Fixes: #24556. Closes scylladb/scylladb#24560	2025-06-25 09:08:06 +03:00
Botond Dénes	edc2906892	test/boost/sstable_datafile_test: add test for corrupt data * create a table with random schema * generate data: random mutations + one row with bad key * write data to sstable * check that only good data is written to sstable * check that the bad data was saved to system.corrupt_data	2025-06-25 08:41:29 +03:00
Botond Dénes	592ca789e2	sstables/mx/writer: handler rows with empty keys Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Use the recently introduced corrupt_data_handler to handle rows with such corrupt keys. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.	2025-06-25 08:41:29 +03:00
Botond Dénes	aae212a87c	test/lib/cql_assertions: introduce columns_assertions To enable targeted and optionally typed assertions against individual columns in a row.	2025-06-25 08:41:29 +03:00
Botond Dénes	ebd9420687	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers.	2025-06-25 08:41:26 +03:00
Botond Dénes	46ff7f9c12	tools/scylla-sstable: make large_data_handler a local No reason for it to be a global, not even convenience.	2025-06-25 08:35:19 +03:00
Andrei Chekun	d81e0d0754	test.py: pytest c++ facades should respect saving logs on success BostFacade and UnitFacade saving the logs only when test failed, ignoring the -s parameter that should allow save logs on success. This PR adding checking this parameter. Closes scylladb/scylladb#24596	2025-06-24 20:53:32 +03:00
Botond Dénes	3e1c50e9a7	db: introduce corrupt_data_handler Similar to large_data_handler, this interface allows sstable writers to delegate the handling of corrupt data. Two implementations are provided: * system_table_corrupt_data_handler - saved corrupt data in system.corrupt_data, with a TTL=10days (non-configurable for now) * nop_corrupt_data_handler - drops corrupt data	2025-06-24 14:57:00 +03:00
Botond Dénes	b931145a26	mutation: introduce frozen_mutation_fragment_v2 Mirrors frozen_mutation_fragment and shares most of the underlying serialization code, the only exception is replacing range_tombstone with range_tombstone_change in the mutation fragment variant.	2025-06-24 11:05:31 +03:00
Botond Dénes	64f8500367	mutation/mutation_partition_view: read_{clustering,static}_row(): return row type Instead of mutation_fragment, let caller convert into mutation_fragment. Allows reuse in future callers which will want to convert to mutation_fragment_v2.	2025-06-24 11:05:31 +03:00
Botond Dénes	678deece88	mutation/mutation_partition_view: extract de-ser of {clustering,static} row From the visitor in frozen_mutation_fragment::unfreeze(). We will want to re-use it in the future frozen_mutation_fragment_v2::unfreeze(). Code-movement only, the code is not changed.	2025-06-24 11:05:31 +03:00
Botond Dénes	093d4f8d69	idl-compiler.py: generate skip() definition for enums serializers Currently they only have the declaration and so far they got away with it, looks like no users exists, but this is about to change so generate the definition too.	2025-06-24 11:05:31 +03:00
Botond Dénes	b0d5462440	idl: extract full_position.idl from position_in_partition.idl A future user of position_in_partition.idl doesn't need full_position and so doesn't want to include full_position.hh to fix compile errors when including position_in_partition.idl.hh. Extract it to a separate idl file: it has a single user in a storage_proxy VERB.	2025-06-24 11:05:30 +03:00
Botond Dénes	0753643606	db/system_keyspace: add apply_mutation() Allow applying writes in the form of mutations directly to the keyspace. Allows lower-level mutation API to build writes. Advantageous if writes can contain large cells that would otherwise possibly cause large allocation warnings if used via the internal CQL API.	2025-06-24 11:05:30 +03:00
Botond Dénes	92b5fe8983	db/system_keyspace: introduce the corrupt_data table To serve as a place to store corrupt mutation fragments. These fragments cannot be written to sstables, as they would be spread around by compaction and/or repair. They even might make parsing the sstable impossible. So they are stored in this special table instead, kept around to be inspected later and possibly restored if possible.	2025-06-24 11:05:30 +03:00
Abhinav Jha	5ff693eff6	group0: modify `start_operation` logic to account for synchronize phase race condition In the present scenario, the bootstrapping node undergoes synchronize phase after initialization of group0, then enters post_raft phase and becomes fully ready for group0 operations. The topology coordinator is agnostic of this and issues stream ranges command as soon as the node successfully completes `join_group0`. Although for a node booting into an already upgraded cluster, the time duration for which, node remains in synchronize phase is negligible but this race condition causes trouble in a small percentage of cases, since the stream ranges operation fails and node fails to bootstrap. This commit addresses this issue and updates the error throw logic to account for this edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing error. A regression test is also added to confirm the working of this code change. The test adds a wait in synchronize phase for newly joining node and releases only after the program counter reaches the synchronize case in the `start_operation` function. Hence it indicates that in the updated code, the start_operation will wait for the node to get done with the synchronize phase instead of throwing error. This PR fixes a bug. Hence we need to backport it. Fixes: scylladb/scylladb#23536 Closes scylladb/scylladb#23829	2025-06-24 10:04:39 +02:00
Botond Dénes	bce89c0f5e	sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path So parse errors on corrupt SSTables don't result in crashes, instead just aborting the read in process. There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This patch tried to focus on those usages which are in the read path. Some places not only used on the read path may have been converted too, where the usage of said method is not clear.	2025-06-24 09:16:28 +03:00
Botond Dénes	27e26ed93f	sstables/exceptions: introduce parse_assert() To replace SCYLLA_ASSERT on the read/parse path. SSTables can get corrupt for various reasons, some outside of the database's control. A bad SSTable should not bring down the database, the parsing should simply be aborted, with as much information printed as possible for the investigation of the nature of the corruption. The newly introduced parse_assert() uses on_internal_error() under the hood, which prints a backtrace and optionally allows for aborting when on the error, to generate a coredump.	2025-06-24 09:15:29 +03:00
Jenkins Promoter	b0a7fcf21b	Update pgo profiles - aarch64	2025-06-23 19:20:50 +03:00
Jenkins Promoter	e15e5a6081	Update pgo profiles - x86_64	2025-06-23 19:20:50 +03:00
Marcin Maliszkiewicz	68ead01397	test: add test for live updates of generic server config Affected config: uninitialized_connections_semaphore_cpu_concurrency	2025-06-23 17:56:26 +02:00
Marcin Maliszkiewicz	45392ac29e	utils: don't allow do discard updateable_value observer If the object returned from observe() is destructured, it stops observing, potentially causing subtle bugs. Typically, the observer object is retained as a class member.	2025-06-23 17:54:01 +02:00
Marcin Maliszkiewicz	c6a25b9140	generic_server: fix connections semaphore config observer When temporary value returned by observer() is destructed it disconnects from updateable_value so the code immediately stops observing. To fix it we need to retain the observer in the class object.	2025-06-23 17:54:01 +02:00
Patryk Jędrzejczak	6489308ebc	Merge 'Introduce a queue of global topology requests.' from Gleb Natapov Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously. Fixes: #16822 No need to backport since this is a new feature. Closes scylladb/scylladb#24293 * https://github.com/scylladb/scylladb: topology coordinator: simplify truncate handling in case request queue feature is disable topology coordinator: fix indentation after the previous patch topology coordinator: allow running multiple global commands in parallel topology coordinator: Implement global topology request queue topology coordinator: Do not cancel global requests in cancel_all_requests topology coordinator: store request type for each global command topology request: make it possible to hold global request types in request_type field topology coordinator: move alter table global request parameters into topology_request table topology coordinator: move cleanup global command to report completion through topology_request table topology coordinator: no need to create updates vector explicitly topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it topology coordinator: handle error during new_cdc_generation command processing topology coordinator: remove unneeded semicolon topology coordinator: fix indentation after the last commit topology coordinator: move new_cdc_generation topology request to use topology_request table for completion gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag	2025-06-23 16:08:09 +03:00
Aleksandra Martyniuk	9c3fd2a9df	nodetool: repair: repair only vnode keyspaces nodetool repair command repairs only vnode keyspaces. If a user tries to repair a tablet keyspace, an exception is thrown. Closes scylladb/scylladb#23660	2025-06-23 16:08:09 +03:00
Avi Kivity	52f11e140f	tools: optimized_clang: make it work in the presence of a scylladb profile optimized_clang.sh trains the compiler using profile-guided optimization (pgo). However, while doing that, it builds scylladb using its own profile stored in pgo/profiles and decompressed into build/profile.profdata. Due to the funky directory structure used for training the compiler, that path is invalid during the training and the build fails. The workaround was to build on a cloud machine instead of a workstation - this worked because the cloud machine didn't have git-lfs installed, and therefore did not see the stored profile, and the whole mess was averted. To make this work on a machine that does have access to stored profiles, disable use of the stored profile even if it exists. Fixes #22713 Closes scylladb/scylladb#24571	2025-06-23 16:08:09 +03:00
Botond Dénes	ab96c703ff	mutation: check key of inserted rows Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to work with the changes: it has to make the schema use compact storage, otherwise the non-full changes used by this tests are rejected by the new checks. Fixes: https://github.com/scylladb/scylladb/issues/24506	2025-06-23 09:38:45 +03:00
Botond Dénes	8b756ea837	compound: optimize is_full() for single-component types For such compounds, unserializing the key is not necessary to determine whether the key is full or not.	2025-06-23 09:38:45 +03:00
Nadav Har'El	85c19d21bb	Merge 'cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Karol Nowacki cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly. Fixes #4480 Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release. Closes scylladb/scylladb#24500 * github.com:scylladb/scylladb: cql, schema: Extend name length limit from 48 to 192 bytes replica: Remove unused keyspace::init_storage()	2025-06-22 17:41:10 +03:00
Avi Kivity	770b91447b	Merge 'memtable: ensure _flushed_memory doesn't grow above total_memory' from Michał Chojnowski `dirty_memory_manager` tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. `dirty_memory_manager` controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in `~flush_memory_accounter` which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen by `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the `mutation_cleaner` later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in `mutation_cleaner` intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. Fixes scylladb/scylladb#21413 Backport candidate because it fixes a crash that can happen in existing stable branches. Closes scylladb/scylladb#21638 * github.com:scylladb/scylladb: memtable: ensure _flushed_memory doesn't grow above total memory usage replica/memtable: move region_listener handlers from dirty_memory_manager to memtable	2025-06-22 11:19:25 +03:00
Michał Chojnowski	975e7e405a	memtable: ensure _flushed_memory doesn't grow above total memory usage dirty_memory_manager tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. dirty_memory_manager controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in ~flush_memory_accounter which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the mutation_cleaner later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in mutation_cleaner intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.	2025-06-20 11:42:30 +02:00
Michał Chojnowski	7d551f99be	replica/memtable: move region_listener handlers from dirty_memory_manager to memtable The memtable wants to listen for changes in its `total_memory` in order to decrease its `_flushed_memory` in case some of the freed memory has already been accounted as flushed. (This can happen because the flush reader sees and accounts even outdated MVCC versions, which can be deleted and freed during the flush). Today, the memtable doesn't listen to those changes directly. Instead, some calls which can affect `total_memory` (in particular, the mutation cleaner) manually check the value of `total_memory` before and after they run, and they pass the difference to the memtable. But that's not good enough, because `total_memory` can also change outside of those manually-checked calls -- for example, during LSA compaction, which can occur anytime. This makes memtable's accounting inaccurate and can lead to unexpected states. But we already have an interface for listening to `total_memory` changes actively, and `dirty_memory_manager`, which also needs to know it, does just that. So what happens e.g. when `mutation_cleaner` runs is that `mutation_cleaner` checks the value of `total_memory` before it runs, then it runs, causing several changes to `total_memory` which are picked up by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of `total_memory` and passes the difference to `memtable`, which corrects whatever was observed by `dirty_memory_manager`. To allow memtable to modify its `_flushed_memory` correctly, we need to make `memtable` itself a `region_listener`. Also, instead of the situation where `dirty_memory_manager` receives `total_memory` change notifications from `logalloc` directly, and `memtable` fixes the manager's state later, we want to only the memtable listen for the notifications, and pass them already modified accordingl to the manager, so there is no intermediate wrong states. This patch moves the `region_listener` callbacks from the `dirty_memory_manager` to the `memtable`. It's not intended to be a functional change, just a source code refactoring. The next patch will be a functional change enabled by this.	2025-06-20 11:42:30 +02:00
Łukasz Paszkowski	a9a53d9178	compaction_manager: cancel submission timer on drain The `drain` method, cancels all running compactions and moves the compaction manager into the disabled state. To move it back to the enabled state, the `enable` method shall be called. This, however, throws an assertion error as the submission time is not cancelled and re-enabling the manager tries to arm the armed timer. Thus, cancel the timer, when calling the drain method to disable the compaction manager. Fixes https://github.com/scylladb/scylladb/issues/24504 All versions are affected. So it's a good candidate for a backport. Closes scylladb/scylladb#24505	2025-06-20 11:33:49 +03:00
Nadav Har'El	70f5a6a4d6	test/cqlpy: fix run-cassandra script to ignore CASSANDRA_HOME As test/cqlpy/README.md explains, the way to tell the run-cassandra script which version of Cassandra should be run is through the "CASSANDRA" variable, for example: CASSANDRA=$HOME/apache-cassandra-4.1.6/bin/cassandra \ test/cqlpy/run-cassandra test_file.py::test_function But all the Cassandra scripts, of all versions, have one strange feature: If you set CASSANDRA_HOME, then instead of running the actual Cassandra script you tried to run (in this case, 4.1.6), the Cassandra script goes to run the other Cassandra from CASSANDRA_HOME! This means that if a user happens to have, for some reason, set CASSANDRA_HOME, then the documented "CASSANDRA" variable doesn't work. The simple fix is to clear CASSANDRA_HOME in the environment that run-cassandra passes to Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24546	2025-06-20 11:31:02 +03:00
Anna Stuchlik	17eabbe712	doc: improve the tablets limitations section This PR improves the Limitations and Unsupported Features section for tablets, as it has been confusing to the customers. Refs https://github.com/scylladb/scylla-enterprise/issues/5465 Fixes https://github.com/scylladb/scylladb/issues/24562 Closes scylladb/scylladb#24563	2025-06-20 11:28:38 +03:00
Gleb Natapov	e364995e28	api: return error from get_host_id_map if gossiper is not enabled yet. Token metadata api is initialized before gossiper is started. get_host_id_map REST endpoint cannot function without the fully initialized gossiper though. The gossiper is started deep in the join_cluster call chain, but if we move token_metadata api initialization after the call it means that no api will be available during bootstrap. This is not what we want. Make a simple fix by returning an error from the api if the gossiper is not initialized yet. Fixes: #24479 Closes scylladb/scylladb#24575	2025-06-20 11:27:28 +03:00
Andrei Chekun	392a7fc171	test.py: Fix the boost output file name File name for the boost test do not use run_id, so each consequent run will overwrite the logs from the previous one. If the first repeat fails, and the second will pass, it overwrites the failed log. This PR allows saving the failed one. Closes scylladb/scylladb#24580	2025-06-20 11:26:16 +03:00
Asias He	c5a136c3b5	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561	2025-06-19 16:51:01 +03:00
Andrei Chekun	fcc2ad8ff5	test.py: Fix test result are overwritten Currently, CI uses several nodes to execute the different modes to reduce overall time for execution. During copying the results from nodes to the main job test reports will be overwritten, since they are using the same directory and the same name. This patch allows to distinguishing these results and not overwrite them. Closes scylladb/scylladb#24559	2025-06-19 16:51:01 +03:00
Pavel Emelyanov	dc166be663	s3: Mark claimed_buffer constructor noexcept It just std::move-s a buffer and a semaphore_units objects, both moves are noexcept, so is the constructor itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24552	2025-06-18 20:36:45 +03:00
Avi Kivity	c89ab90554	Merge 'main: don't start maintenance auth service if not enabled' from Marcin Maliszkiewicz In `f96d30c2b5` we introduced the maintenance service, which is an additional instance of auth::service. But this service has a somewhat confusing 2-level startup mechanism: it's initialized with sharded<Service>::start and then auth::service::start (different method with the same name to confuse even more). When maintenance_socket was disabled (default setting), the code did only the first part of the startup. This registered a config observer but didn't create a permission_cache instance. As a result, a crash on SIGHUP when config is reloaded can occur. Fixes: https://github.com/scylladb/scylladb/issues/24528 Backport: all not eol versions since 6.0 and 2025.1 Closes scylladb/scylladb#24527 * github.com:scylladb/scylladb: test: add test for live updates of permissions cache config main: don't start maintenance auth service if not enabled	2025-06-18 20:28:53 +03:00
Karol Nowacki	4577c66a04	cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly.	2025-06-18 14:08:38 +02:00
Karol Nowacki	a41c12cd85	replica: Remove unused keyspace::init_storage() This function was declared but had no implementation or callers. It is being removed as minor code cleanup.	2025-06-18 14:08:38 +02:00
Petr Gusev	45f5efb9ba	paxos_state: read repair for intranode_migration A replica is not marked as 'pending' during intranode_migration. The sp::get_paxos_participants returns the same set of endpoints as before or after migration. No 'double quorum' means the replica should behave as a single paxos acceptor. This is done by making sure that the state on both shards is the same when reading and repairing it before continuing if it is not.	2025-06-18 12:11:32 +02:00
Petr Gusev	583fb0e402	paxos_state: fix get_replica_lock for intranode_migration Suppose a replica gets two requests at roughly the same time for the same key. The requests are coming from two different LWT coordinators, one is holding tablet_transition_stage::streaming erm, another - tablet_transition_stage::write_both_read_new erm. The read shard is different for these requests, so they don't wait each other in get_replica_lock. The first request reads the state, the second request does the whole RMW for paxos state and responds to its coordinator, then the first request blindly overwrites the state -- the effects of the second requst are lost. In this commit we fix this problem by taking the lock on both shards, starting from the smaller shard ID to the larger one, to avoid deadlocks.	2025-06-18 12:11:32 +02:00
Marcin Maliszkiewicz	dd01852341	test: add test for live updates of permissions cache config	2025-06-18 11:27:08 +02:00
Marcin Maliszkiewicz	97c60b8153	main: don't start maintenance auth service if not enabled In `f96d30c2b5` we introduced the maintenance service, which is an additional instance of auth::service. But this service has a somewhat confusing 2-level startup mechanism: it's initialized with sharded<Service>::start and then auth::service::start (different method with the same name to confuse even more). When maintenance_socket was disabled (default setting), the code did only the first part of the startup. This registered a config observer but didn't create a permission_cache instance. As a result, a crash on SIGHUP when config is reloaded can occur.	2025-06-18 11:27:08 +02:00
Botond Dénes	da1a3dd640	Merge 'test: introduce upgrade tests to test.py, add a SSTable dict compression upgrade test' from Michał Chojnowski This PR adds an upgrade test for SSTable compression with shared dictionaries, and adds some bits to pylib and test.py to support that. In the series, we: 1. Mount `$XDG_CACHE_DIR` into dbuild. 2. Add a pylib function which downloads and installs a released ScyllaDB package into a subdirectory of `$XDG_CACHE_DIR/scylladb/test.py`, and returns the path to `bin/scylla`. 3. Add new methods and params to the cluster manager, which let the test start nodes with historical Scylla executables, and switch executables during the test. 4. Add a test which uses the above to run an upgrade test between the released package and the current build. 5. Add `--run-internet-dependent-tests` to `test.py` which lets the user of `test.py` skip this test (and potentially other internet-dependent tests in the future). (The patch modifying `wait_for_cql_and_get_hosts` is a part of the new test — the new test needs it to test how particular nodes in a mixed-version cluster react to some CQL queries.) This is a follow-up to #23025, split into a separate PR because the potential addition of upgrade tests to `test.py` deserved a separate thread. Needs backport to 2025.2, because that's where the tested feature is introduced. Fixes #24110 Closes scylladb/scylladb#23538 * github.com:scylladb/scylladb: test: add test_sstable_compression_dictionaries_upgrade.py test.py: add --run-internet-dependent-tests pylib/manager_client: add server_switch_executable test/pylib: in add_server, give a way to specify the executable and version-specific config pylib: pass scylla_env environment variables to the topology suite test/pylib: add get_scylla_2025_1_executable() pylib/scylla_cluster: give a way to pass executable-specific options to nodes dbuild: mount "$XDG_CACHE_HOME/scylladb"	2025-06-18 12:21:21 +03:00
Benny Halevy	7c867b308f	feature_service: never disable UUID_SSTABLE_IDENTIFIERS The config option is unused since `6da758d74c` Refs #10459 Refs #20337 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	ecc7272a07	test: sstable_move_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	49ca442e7c	test: sstable_directory_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	15bee9f232	sstables: sstable_generation_generator: set last_generation=0 by default Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	079c5fe5e3	test: database_test: test_distributed_loader_with_pending_delete: use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	f0f7c83705	test: lib: test_env: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	0310a03de6	test: sstable_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	b00b805da6	test: sstable_resharding_test::sstable_resharding_over_s3_test: use default use_uuid in config Which is `true` by default anyhow. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	f644c5896f	test: sstable_datafile_test: compound_sstable_set_basic_test: use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Benny Halevy	bfa0bb78f9	test: sstable_compaction_test: always use uuid sstable generation Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Avi Kivity	2177ec8dc1	gdb: adjust unordered container accessors for libstdc++15 In libstdc++15, the internal structure of an unordered container hashtable node changed from _M_storage._M_storage.__data to just _M_storage._M_storage (though the layout is the same). Adjust the code to work with both variants. Closes scylladb/scylladb#24549	2025-06-18 09:15:03 +03:00
Michał Chojnowski	27f66fb110	test/boost/mutation_reader_test: fix a use-after-free in `test_fast_forwarding_combined_reader_is_consistent_with_slicing` The contract in mutation_reader.hh says: ``` // pr needs to be valid until the reader is destroyed or fast_forward_to() // is called again. future<> fast_forward_to(const dht::partition_range& pr) { ``` `test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates this by passing a temporary to `fast_forward_to`. Fix that. Fixes scylladb/scylladb#24542 Closes scylladb/scylladb#24543	2025-06-17 19:30:50 +03:00
Anna Stuchlik	648d8caf27	doc: add support for z3 GCP This commit adds support for z3-highmem-highlssd instance types to Cloud Instance Recommendations for GCP. Fixes https://github.com/scylladb/scylladb/issues/24511 Closes scylladb/scylladb#24533	2025-06-17 13:50:46 +03:00
Robert Bindar	1dd37ba47a	Add dev documentation for manipulating s3 data manually This patch intends to give an overview of where, when and how we store data in S3 and provide a quick set of commands which help gain local access to the data in case there is a need for manual intervention. The patch also collects in the same place links/descriptions for all formats we use in S3. Fixes #22438 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#24323	2025-06-17 13:21:30 +03:00
Pavel Emelyanov	b0766d1e73	Merge 's3_client: Refactor `range` class for state validation' from Ernest Zaslavsky Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333 No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880 Closes scylladb/scylladb#24312 * github.com:scylladb/scylladb: s3_client: headers cleanup s3_client: Refactor `range` class for state validation	2025-06-17 10:34:55 +03:00
Ernest Zaslavsky	e398576795	s3_client: Fix hang in get() on EOF by signaling condition variable * Ensure _get_cv.signal() is called when an empty buffer received * Prevents `get()` from stalling indefinitely while waiting on EOF * Found when testing https://github.com/scylladb/scylladb/pull/23695 Closes scylladb/scylladb#24490	2025-06-17 10:33:19 +03:00
Calle Wilund	4a98c258f6	http: Add missing thread_local specifier for static Refs #24447 Patch adding this somehow managed to leave out the thread_local specifier. While gnutls cert object can be shared across shards just fine, the actual shared_ptr here cannot, thus we could cause memory errors. Closes scylladb/scylladb#24514	2025-06-17 10:23:52 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Ernest Zaslavsky	1b20e0be4a	s3_client: headers cleanup	2025-06-16 16:02:30 +03:00
Ernest Zaslavsky	9ad7a456fe	s3_client: Refactor `range` class for state validation Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.	2025-06-16 16:02:24 +03:00
Pavel Emelyanov	5c2e5890a6	Merge 'test.py: Integrate pytest c++ test execution to test.py' from Andrei Chekun With current changes, pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade to have the possibility to get them for C++ test as previously. Since boost, raft, unit, and ldap directories aren't executed by test.py, suite.yaml files are renamed to test_config.yaml to preserve the old way of test configuration and removing them from execution by test.py Pytest executes all modes by itself, JUnit report for the C++ test will be one for the run. That means that there is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in testlog/report/allure. Breaking changes: 1. Terminal output changed. test.py will run pytest for the next directories: `test/boost`, `test/ldap`, `test/raft`, `test/unit`. `test.py` will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, `test.py` will continue to show previous output for the rest of the test. 2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the file with extension. For example, instead of `boost/aggregate_fcts_test` now you need to use `test/boost/aggregate_fcts_test.cc` 3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result. Now it shows nine results, with differentiating them by mode and run. Note: Pytest uses pytest-xdist module to run tests in parallel. The Frozen toolchain has this dependency installed, for the local use, please install it manually. Changes for CI https://github.com/scylladb/scylla-pkg/pull/4949. It will be merged after the current PR will be in master. Short disruption is expected, while PR in scylla-pkg will not be merged. Fixes: https://github.com/scylladb/qa-tasks/issues/1777 Closes scylladb/scylladb#22894 * github.com:scylladb/scylladb: test.py: clean code that isn't used anymore test.py: switch off C++ tests from test.py discovery test.py: Integrate pytest c++ test execution to test.py	2025-06-16 16:01:37 +03:00
Pavel Emelyanov	0b6532a895	api: Shorten get_simple_states() handler The one collects map<ip, state> then converts it to a jsonable vector of helper objects with key and value members. This patch removes the intermediate map and creates the vector instantly. With that change the handler makes less data manipulations and behaves like the get_all_endpoint_states one. Very similar change was done in `12420dc644` with get_host_to_id_map handler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24456	2025-06-16 15:21:27 +03:00
Tomasz Grabiec	cdb1499898	Merge 'interval: reduce memory footprint' from Avi Kivity The interval class's memory footprint isn't important for single objects, but intervals are frequently held in moderately sized collections. In #3335 this caused a stall. Therefore reducing interval's memory footprint and reduce allocation pressure. This series does this by consolidating badly-padded booleans in the object tree spanned by interval into 5 booleans that are consecutive in memory. This reduces the space required by these booleans from 40 bytes to 8 bytes. perf-simple-query report (with refresh-pgo-profiles.sh for each measurement): before: 252127.60 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37128 insns/op, 18147 cycles/op, 0 errors) INFO 2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to 1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4) 246492.37 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37153 insns/op, 18411 cycles/op, 0 errors) 253633.11 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37127 insns/op, 17941 cycles/op, 0 errors) 254029.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37155 insns/op, 17951 cycles/op, 0 errors) 254465.76 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37123 insns/op, 17906 cycles/op, 0 errors) throughput: mean= 252149.75 standard-deviation=3282.75 median= 253633.11 median-absolute-deviation=1880.17 maximum=254465.76 minimum=246492.37 instructions_per_op: mean= 37137.24 standard-deviation=15.71 median= 37127.54 median-absolute-deviation=14.45 maximum=37155.24 minimum=37122.79 cpu_cycles_per_op: mean= 18071.19 standard-deviation=212.25 median= 17950.62 median-absolute-deviation=130.10 maximum=18411.50 minimum=17906.13 after: 252561.26 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37039 insns/op, 18075 cycles/op, 0 errors) 256876.44 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37022 insns/op, 17785 cycles/op, 0 errors) 257084.38 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37030 insns/op, 17840 cycles/op, 0 errors) 257305.35 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37042 insns/op, 17804 cycles/op, 0 errors) 258088.53 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37028 insns/op, 17778 cycles/op, 0 errors) throughput: mean= 256383.19 standard-deviation=2185.22 median= 257084.38 median-absolute-deviation=922.16 maximum=258088.53 minimum=252561.26 instructions_per_op: mean= 37032.17 standard-deviation=8.06 median= 37030.46 median-absolute-deviation=6.44 maximum=37041.83 minimum=37021.93 cpu_cycles_per_op: mean= 17856.60 standard-deviation=124.70 median= 17804.16 median-absolute-deviation=71.24 maximum=18075.50 minimum=17777.95 A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test. Small performance improvement, so not a backport candidate. Closes scylladb/scylladb#24232 * github.com:scylladb/scylladb: interval: reduce sizeof interval: change start()/end() not to return references to data members interval: rename start_ref() back to start() (and end_ref() etc). interval: rename start() to start_ref() (and end() etc). test: wrapping_interval_test: add more tests for intervals	2025-06-16 09:23:56 +02:00
Botond Dénes	898ce98500	db/batchlog_manager: remove unused member _total_batches_replayed And its getter. There are no users for either. Closes scylladb/scylladb#24416	2025-06-16 09:37:00 +03:00
Nadav Har'El	847d9c0911	alternator: update documentation that ttl with tablets does work Our documentation docs/alternator/new-apis.md claims that Alternator TTL does not work with tablets, due to issue #16567. However, we fixed that issue in commit `de96c28625`. So let's drop the outdated statement that it doesn't work. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24427	2025-06-16 09:36:11 +03:00
Ernest Zaslavsky	2b300c8eb9	s3_client: Improve reporting of S3 client statistics Revise how we report statistics for `chunked_download_source`. Ensure metrics for downloaded but unconsumed data are visible, as they do not contribute to read amplification, which is tracked separately. Closes scylladb/scylladb#24491	2025-06-16 09:33:57 +03:00
Pavel Emelyanov	9aaa33c15a	Merge 'main.cc: fix group0 shutdown order' from Petr Gusev Applier fiber needs local storage, so before shutting down local storage we need to make sure that group0 is stopped. We also improve the logs for the case when `gate_closed_exception` is thrown while a mutation is being written. Fixes [scylladb/scylladb#24401](https://github.com/scylladb/scylladb/issues/24401) Backport: no backport -- not safe and the problem is minor. Closes scylladb/scylladb#24418 * github.com:scylladb/scylladb: storage_service: test_group0_apply_while_node_is_being_shutdown main.cc: fix group0 shutdown order storage_proxy: log gate_closed_exception	2025-06-16 09:32:34 +03:00
Amnon Heiman	55b21b01ee	alternator/stats.cc, metrics-config.yml: docs fix per-table metrics This patch updates alternator/stats.cc and the get_description.py configuration (metrics-config.yml) to restore compatibility with per-table alternator metrics in the documentation generation process. Previously, the group name for metrics was selected using an inline expression like (has_table)? "alternator_table" : "alternator", which made it difficult to maintain a straightforward mapping in the configuration file. With this change, the group name is now assigned to a variable in alternator/stats.cc, allowing metrics-config.yml to map group names directly. This makes the configuration easier to maintain and enables get_description.py to document both global and per-table metrics correctly. This is a minimal, targeted fix to get the documentation working again with the new per-table metrics format. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#24509	2025-06-15 18:06:36 +03:00
Jenkins Promoter	1b5eee6a12	Update pgo profiles - aarch64	2025-06-15 04:57:59 +03:00
Jenkins Promoter	e0c2d591c7	Update pgo profiles - x86_64	2025-06-15 04:44:13 +03:00
Avi Kivity	42d7ae1082	interval: reduce sizeof An interval object stores five booleans: start()->is_inclusive(), a boolean since start() itself is an std::optional, two more for end(), and is_singular(). Due to bad packing, these five booleans occupy 8 bytes each, for a total of 40 bytes. Re-pack the interval class by storing those booleans explicitly close by. Since we lose std::optional's ability to store a maybe-constructed object, we re-implement it using anonymous unions and therefore have to implement the 5 special methods. This helps saves space when vectors of intervals are used, as seen in #3335 for example.	2025-06-14 21:29:43 +03:00
Avi Kivity	f3dccc2215	interval: change start()/end() not to return references to data members We'd like to change the data layout of `interval` to save space. As a result, start() and end() which return references to data members must return objects (not references). Since we'd like to maintain zero-copy for these functions, we change them to return objects containing references (rather than references to objects), avoiding copying of potentially expensive objects. We repurpose the interval_bound class to hold references (by instantiating it with `const T&` instead of `T`) and provide converting constructors. To make transform_bounds() retain zero-copy, we add start() and end() that take *this by rvalue reference.	2025-06-14 21:26:17 +03:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00
Avi Kivity	3363bc41e2	interval: rename start() to start_ref() (and end() etc). We are about to change start() to return a proxy object rather than a `const interval_bound<T>&`. This is generally transparent, except in one case: `auto x = i.start()`. With the current implementation, we'll copy object referred to and assign it to x. With the planned implementation, the proxy object will be assigned to `x`, but it will keep referring to `i`. To prevent such problems, rename start() to start_ref() and end() to end_ref(). This forces us to audit all calls, and redirect calls that will break to new start_copy() and end_copy() methods.	2025-06-14 21:26:16 +03:00
Avi Kivity	674118fd2e	test: wrapping_interval_test: add more tests for intervals In this series, we will make interval manage its memory directly, specifically it will directly construct and destroy T values that it contains rather than let std::optional<T> manage those values itself. Add tests that expose bugs encountered during development (actually, review) of this series. The tests pass before the series, fail with series as it was before fixing, and pass with the series as it is now. The tests use a class maybe_throwing_interval_payload that can be set to throw at strategic locations and exercise all the interesting interval shapes.	2025-06-14 21:26:14 +03:00
Patryk Jędrzejczak	c4cf95aeb3	Merge 'raft: simplify voter handler code to not pass node references around' from Emil Maskovsky Refactor the voter handler logic to only pass around node IDs (`raft::server_id`), instead of pairs of IDs and node descriptor references. Node descriptors can always be efficiently retrieved from the original nodes map, which remains valid throughout the calculation. This change reduces unnecessary reference passing and simplifies the code. All node detail lookups are now performed via the central nodes map as needed. Additional cleanup has been done: * removing redundant comments (that just repeat what the code does) * use explicit comparators for the datacenter and rack information priorities (instead of the comparison operator) to be more explicit about the prioritization Fixes: scylladb/scylladb#24035 No backport: This change does not fix any bug and doesn't change the behavior, just cleans up the code in master, therefore no backport is needed. Closes scylladb/scylladb#24452 * https://github.com/scylladb/scylladb: raft: simplify voter handler code to not pass node references around raft: reformat voter handler for consistent indentation raft: use explicit priority comparators for datacenters and racks raft: clean up voter handler by removing redundant comments	2025-06-13 19:02:07 +02:00
Anna Stuchlik	e2b7302183	doc: extend 2025.2 upgrade with a note about consistent topology updates This commit adds a note that the user should enable consistent topology updates before upgrading to 2025.2 if they didn't do it (for some reason) when previously upgrading to version 2025.1. Fixes https://github.com/scylladb/scylladb/issues/24467 Closes scylladb/scylladb#24468	2025-06-13 13:54:59 +03:00
Piotr Dulikowski	238fc24800	Merge 'test: dtest: move audit_test.py to test.py' from Andrzej Jackowski Copied the entire audit_test.py from scylladb/scylla-dtest, to remove the entire file from scylla-dtest after this patch series is merged. The motivation is to move entire audit testing to from dtests, to make it easier to maintain and more reliable. After audit_test.py was moved from dtests to test.py, some issues that require fixing arose due to differences between the frameworks. No backport, moving audit_test.py to test.py is a new testing effort. Closes scylladb/scylladb#24231 * github.com:scylladb/scylladb: test: audit: filter out LOGIN and USE audit logs test: audit: remove require mark test: audit: wait until raft state is applied in test_permissions test: audit: fix problems in audit_test.py test: dtest: add dict support to populate in scylla_cluster.py test: dtest: copied get_node_ip from dtests to scylla_cluster.py test: dtest: copy run_rest_api from dtests to cluster.py test: dtest: copy run_in_parallel from dtests to data.py test: audit: copy unmodified audit_test.py from dtests	2025-06-12 09:03:45 +02:00
Andrei Chekun	570aaa2ecb	test.py: clean code that isn't used anymore Clean code that is not used anymore	2025-06-11 18:29:26 +02:00
Andrei Chekun	9dca7719b1	test.py: switch off C++ tests from test.py discovery Switch off C++ tests from test.py discovery. With this change, test.py loses the ability to directly see and run the C++ tests. Instead, it'll delegate all things to the pytest. Since boost, raft, unit, and ldap directories aren't executed by test.py, suite.yaml files are renamed to test_config.yaml to preserve the old way of test configuration and removing them from execution by test.py Before this patch boost test were visible by test.py and pytest. So if the test.py will be invoked without test name, it will execute boost tests twice: with test.py executor and with pytest executor. Depending on the test name according executor will be used. For example, if test name is test/boost/aggregate_fcts_test.cc it will be executed by pytest, but if the boost/aggregate_fcts_test it will be executed by test.py executor.	2025-06-11 18:29:26 +02:00
Andrei Chekun	42d9dbe66a	test.py: Integrate pytest c++ test execution to test.py With current changes pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade to have the possibility to get them for C++ test as previously. Since pytest executes all modes by itself JUnit report for the C++ test will be one for the run. That means that there is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in testlog/report/allure. Breaking changes: 1. Terminal output changed. test.py will run pytest for next directories: test/boost, test/ldap, test/raft, test/unit. test.py will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, test.py will continue to show previous output for the rest of the test. 2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the file with extension. For example, instead of boost/aggregate_fcts_test now you need to use test/boost/aggregate_fcts_test.cc 3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result. Now it shows nine results with differentiating them by mode and run. Note: Pytest uses pytest-xdist module to run tests in parallel. Frozen toolchain has this dependency installed, for the local use, please install it manually.	2025-06-11 18:29:23 +02:00
Tomasz Grabiec	eabc1fa6ff	Merge 'tablets: deallocate storage state on end_migration' from Michael Litvak When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it's already cleaned up 5. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes https://github.com/scylladb/scylladb/issues/23481 backport to all relevant releases since it's a bug that results in a crash Closes scylladb/scylladb#24393 * github.com:scylladb/scylladb: test/cluster/test_tablets: test restart during tablet cleanup test: tablets: add get_tablet_info helper tablets: deallocate storage state on end_migration	2025-06-11 17:37:02 +02:00
Gleb Natapov	c00a0554e0	topology coordinator: simplify truncate handling in case request queue feature is disable After allowing running multiple command in parallel the code that handles multiple truncates to the same table can be simplified since now it is executed only if request queue feature is disable, so it does not need to handle the case where a request may be in the queue.	2025-06-11 11:29:33 +03:00
Gleb Natapov	01dd4b7f30	topology coordinator: fix indentation after the previous patch	2025-06-11 11:29:33 +03:00
Gleb Natapov	a9e99d1d3c	topology coordinator: allow running multiple global commands in parallel Now that we have a global request queue do not check that there is global request before adding another one. Amend truncation test that expects it explicitly and add another one that checks that two truncates can be submitted in parallel.	2025-06-11 11:29:33 +03:00
Gleb Natapov	a0a3a034e0	topology coordinator: Implement global topology request queue Requests, together with their parameters, are added to the topology_request tables and the queue of active global requests is kept in topology state. Thy are processed one by one by the topology state machine. Fixes: #16822	2025-06-11 11:29:33 +03:00
Andrzej Jackowski	e23d79cb62	test: audit: filter out LOGIN and USE audit logs LOGIN entries can appear at many points during testing, for example, when a driver creates a new session. Similarly, `USE ks` statements can appear unexpectedly, especially when the python-driver calls `set_keyspace_async` for new connections. To avoid test checks failures, this commit filters out LOGIN and USE entries in tests that are not intended to verify these two types of audit logs.	2025-06-11 09:43:51 +02:00
Andrzej Jackowski	876eaf459b	test: audit: remove require mark After moving audit tests to dtests, require marks are no longer needed because the tests and the code are in the same repository.	2025-06-11 09:43:51 +02:00
Marcin Maliszkiewicz	111cccf8ba	test: audit: wait until raft state is applied in test_permissions Otherwise test is flaky, expecting permissions to be enforced before they get applied.	2025-06-11 09:43:51 +02:00
Andrzej Jackowski	6c6234979c	test: audit: fix problems in audit_test.py After audit_test.py was moved from dtests to test.py, the following issues arose due to differences between the frameworks: - Some imports were unnecessary or broken - The @pytest.mark.dtest_full decorator was no longer needed - The `issue_open` attribute in `xmark` is not supported - Support for sending SIGHUP is encapsulated by `server_update_config` in test.py` - A workaround for scylladb#24473 was required Moreover, suite.yaml was changed to start running audit_test.py in dev mode. Ref. scylladb#24473 Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-11 09:43:44 +02:00
Michał Chojnowski	0ade15df33	transport/server: silence the oversized allocation warning in snappy_compress It has been observed to generate ~200 kiB allocations. Since we have already been made aware of that, we can silence the warning to clean up the logs. Closes scylladb/scylladb#24360	2025-06-10 19:13:26 +03:00
Petr Gusev	b1050944a3	storage_service: test_group0_apply_while_node_is_being_shutdown	2025-06-10 17:25:03 +02:00
Petr Gusev	6b85ab79d6	main.cc: fix group0 shutdown order group0 persistence relies on local storage, so before shutting down local storage we need to make sure that group0 is stopped. Fixes scylladb/scylladb#24401	2025-06-10 16:06:22 +02:00
Wojciech Mitros	5eb4466789	Return correct creation date time in describe table Add system:table_creation_time tag with value - timestamp in milliseconds of creation table. If the tag is present, it will used to fill creation timestamp value (when CreateTable or DescribeTable is called). If the tag is missing, value 0 for timestamp will be substituted (in other words table was created on 1th january of 1970). Update test to change how we make sure timestamp is actually used - we create two tables one after another and make sure their creation timestamp is in correct order. Update tests, that work with tags to filter system tags out. Fixes #5013 Closes scylladb/scylladb#24007	2025-06-10 15:25:57 +03:00
Nadav Har'El	ed3a0a81d6	test/cqlpy: add some more tests of secondary index system tables This patch adds a couple of basic tests for system tables related to secondary indexes - system."IndexInfo" and system_schema.indexes. I wanted to understand these system tables better when writing documentation for them - so wrote these tests. These tests can also serve as regression tests that verify that we don't accidentally lose support for these system tables. I checked that these tests also pass in Cassandra 3, 4 and 5. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24137	2025-06-10 15:00:51 +03:00
Tomasz Grabiec	0b516da95b	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Closes scylladb/scylladb#20853 * github.com:scylladb/scylladb: storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-06-10 13:45:32 +02:00
Ernest Zaslavsky	30199552ac	s3_client: Mitigate connection exhaustion in `download_source` The existing `download_source` implementation optimizes performance by keeping the connection to S3 open and draining data directly from the socket. While this eliminates the overhead (60-100ms) of repeatedly establishing new connections, it leads to rapid exhaustion of client- side connections. On a single shard, two `mx_readers` for load and stream are enough to trigger this issue. Since each client typically holds two connections, readers keeping index and data sources open can cause deadlocks where processes stall due to unavailable connections. Introduce `chunked_download_source`, a new S3 download method built on `download_source`, to dynamically manage connections: - Buffers data in 5MiB chunks using a producer-consumer model - Closes connections once buffers reach capacity, returning them to the pool for other clients - Uses a filling fiber that resumes fetching once buffers are consumed from the queue Performance remains comparable to `download_source`, achieving 95MiB/s for sequential 1GiB downloads from S3. However, preloading large chunks may cause read amplification. Fixes: https://github.com/scylladb/scylladb/issues/23785 Closes scylladb/scylladb#23880	2025-06-10 12:58:24 +03:00
Anna Stuchlik	b0ced64c88	doc: remove the limitation for disabling CDC This commit removes the instruction to stop all writes before disabling CDC with ALTER. Fixes https://github.com/scylladb/scylla-docs/issues/4020 Closes scylladb/scylladb#24406	2025-06-10 12:53:09 +03:00
Robert Bindar	ca1a9c8d01	Add support for nodetool refresh --skip-reshape This patch adds the new option in nodetool, patches the load_new_ss_tables REST request with a new parameter and skips the reshape step in refresh if this flag is passed. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#24409 Fixes: #24365	2025-06-10 12:52:13 +03:00
David Garcia	62fdebfe78	chore: exclude OS and ENT from google Closes scylladb/scylladb#24417	2025-06-10 12:50:37 +03:00
Emil Maskovsky	b7e0a01fcc	raft: simplify voter handler code to not pass node references around Refactor the voter handler logic to only pass around node IDs (`raft::server_id`), instead of pairs of IDs and node descriptor references. Node descriptors can always be efficiently retrieved from the original nodes map, which remains valid throughout the calculation. This change reduces unnecessary reference passing and simplifies the code. All node detail lookups are now performed via the central nodes map as needed. Fixes: scylladb/scylladb#24035	2025-06-10 11:04:56 +02:00
Emil Maskovsky	839e0bf40d	raft: reformat voter handler for consistent indentation Reformatted the voter handler implementation to comply with clang-format automatic formatting rules. No functional changes.	2025-06-10 11:04:56 +02:00
Emil Maskovsky	05392e6ef3	raft: use explicit priority comparators for datacenters and racks Refactor the voter handler to use explicit priority comparator classes for datacenter and rack selection. This makes the prioritization logic more transparent and robust, and reduces the risk of subtle bugs that could arise from relying on implicit comparison operators.	2025-06-10 11:04:54 +02:00
Emil Maskovsky	e93bf3f05a	raft: clean up voter handler by removing redundant comments Remove comments from the group0 voter handler that simply restate the code or do not provide meaningful clarification. This improves code readability and maintainability by reducing noise and focusing on essential documentation.	2025-06-10 11:03:20 +02:00
Calle Wilund	80feb8b676	utils::http::dns_connection_factory: Use a shared certificate_credentials Fixes #24447 This factory type, which is really more a data holder/connection producer per connection instance, creates, if using https, a new certificate_credentials on every instance. Which when used by S3 client is per client and scheduling groups. Which eventually means that we will do a set_system_trust + "cold" handshake for every tls connection created this way. This will cause both IO and cold/expensive certificate checking -> possible stalls/wasted CPU. Since the credentials object in question is literally a "just trust system", it could very well be shared across the shard. This PR adds a thread local static cached credentials object and uses this instead. Could consider moving this to seastar, but maybe this is too much. Closes scylladb/scylladb#24448	2025-06-10 11:20:21 +03:00
Petr Gusev	e456d2d507	storage_proxy: log gate_closed_exception gate_closed_exception likely signals that we have shutdown order issues. If we just swallow it we lose information what exact component was shutdown prematurely. For example, we stopped local storage before group0 during shutdown in main.cc. If a group0 command arrives, topology_state_load might try to write something and get mutation_write_failure_exception, which results in 'applier fiber stopped because of the error'. There is no other information in the logs in this case, other than 'mutation_write_failure_exception'. It's not clear what the original problem is and what component is triggering it. In this commit we add a warning to the logs when gate_closed_exception is thrown from lmutate or rmutate. Another option is to just remove the try_catch_nested line and allow gate_closed_exception to be logged as an error below. However, this might break some tests which check ERROR lines in the logs.	2025-06-10 10:04:04 +02:00
Andrzej Jackowski	c4e8a2c44e	mapreduce: change next_vnode lambda to get_next_partition_range function The motivation of this code reorganization is to shorten the time when ERM is being kept, done later in this patch series. Ref. scylladb#21831	2025-06-10 09:06:17 +02:00
Michael Litvak	bd88ca92c8	test/cluster/test_tablets: test restart during tablet cleanup Add a test that reproduces issue scylladb/scylladb#23481. The test migrates a tablet from one node to another, and while the tablet is in some stage of cleanup - either before or right after, depending on the parameter - the leaving replica, on which the tablet is cleaned, is restarted. This is interesting because when the leaving replica starts and loads its state, the tablet could be in different stages of cleanup - the SSTables may still exist or they may have been cleaned up already, and we want to make sure the state is loaded correctly.	2025-06-09 17:27:45 +03:00
Michael Litvak	fb18fc0505	test: tablets: add get_tablet_info helper Add a helper for tests to get the tablet info from system.tablets for a tablet owning a given token.	2025-06-09 16:59:07 +03:00
Michael Litvak	34f15ca871	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481	2025-06-09 16:58:38 +03:00
Michael Litvak	8aeb404893	test_cdc_generation_clearing: wait for generations to propagate In test_cdc_generation_clearing we trigger events that update CDC generations, verify the generations are updated as expected, and verify the system topology and CDC generations are consistent on all nodes. Before checking that all nodes are consistent and have the same CDC generations, we need to consider that the changes are propagated through raft and take some time to propagate to all nodes. Currently, we wait for the change to be applied only on the first server which runs the CDC generation publisher fiber and read the CDC generations from this single node. The consistency check that follows could fail if the change was not propagated to some other node yet. To fix that, before checking consistency with all nodes, we execute a read barrier on all nodes so they all see the same state as the leader. Fixes scylladb/scylladb#24407 Closes scylladb/scylladb#24433	2025-06-09 12:59:04 +02:00
Gleb Natapov	bb29591daf	topology coordinator: Do not cancel global requests in cancel_all_requests This was mistakenly added by `fbd75c5c06`. The function is called after checking that no topology request can proceed, so it cancels them, but this has nothing to do with global request. Also, for some reason, the cancellation was added in the loop over topology requests.	2025-06-09 13:38:49 +03:00
Gleb Natapov	be0b328b19	topology coordinator: store request type for each global command	2025-06-09 13:38:49 +03:00
Gleb Natapov	00fd427be0	topology request: make it possible to hold global request types in request_type field topology_request table has a filed to hold a request type, but currently it can hold only per node requests. This patch makes it possible to store global request types there as well.	2025-06-09 13:38:49 +03:00
Gleb Natapov	3a496067c6	topology coordinator: move alter table global request parameters into topology_request table Currently parameters to alter table global topology command are stored in static column in the topology table, but this way there can be only one outstanding alter table request. This patch moves the parameters to the topology_request table where parameters are stored per request.	2025-06-09 13:38:49 +03:00
Gleb Natapov	a9244bf037	topology coordinator: move cleanup global command to report completion through topology_request table We want to unify all command to report completion through the topology_requests table.	2025-06-09 13:38:49 +03:00
Gleb Natapov	6a52ba2251	topology coordinator: no need to create updates vector explicitly	2025-06-09 13:38:49 +03:00
Gleb Natapov	69dacb5894	topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it	2025-06-09 13:38:49 +03:00
Gleb Natapov	7257391c8f	topology coordinator: handle error during new_cdc_generation command processing Currently if there is an error during new_cdc_generation command it is retried in a loop. Since the status of the command executing is now reported through the topology request table we can fail the command instead,	2025-06-09 13:38:48 +03:00
Gleb Natapov	389f0f6280	topology coordinator: remove unneeded semicolon	2025-06-09 13:38:48 +03:00
Gleb Natapov	ba371c09fc	topology coordinator: fix indentation after the last commit	2025-06-09 13:38:48 +03:00
Gleb Natapov	b8c11f330a	topology coordinator: move new_cdc_generation topology request to use topology_request table for completion Currently it checks the completion by waiting for new generation to appear, but we want to unify all commands to check for completion in topology_request table.	2025-06-09 13:38:48 +03:00
Gleb Natapov	6d09c76a12	gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag Will be needed to coordinate between old and new nodes during upgrade.	2025-06-09 13:38:48 +03:00
Anna Stuchlik	93a7146250	doc: add redirections to fix 404 This commit adds redirections for pages on the master branch that were unexpectedly indexed by Google. Those pages no longer exist and return 404. Fixes https://github.com/scylladb/scylladb/issues/24397 Closes scylladb/scylladb#24422	2025-06-09 12:38:10 +02:00
Pavel Emelyanov	46557b3927	table: Touch and sync snapshot directory only once The table::take_snapshot() touches the snapshot directory, which is good. It happens on all shards, which is not that good, because all shards just step on each other toes when doing it, the directory is not sharded. Same for post-snapshot directory sync -- it can happen once, after all shards finish creating snapshot links. Move both, touching and syncing up one level. There's only one caller of the method, so only one caller to update. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24154	2025-06-09 13:36:49 +03:00
Michał Chojnowski	7d26d3c7cb	db/config: add an option that disables dict-aware sstable compressors in DDL statements For reasons, we want to be able to disallow dictionary-aware compressors in chosen deployments. This patch adds a knob for that. When the knob is disabled, dictionary-aware compressors will be rejected in the validation stage of CREATE and ALTER statements. Closes scylladb/scylladb#24355	2025-06-09 13:30:40 +03:00
Raphael S. Carvalho	2d716f3ffe	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426	2025-06-08 15:59:15 +03:00
Nadav Har'El	a714079a62	Merge 'Add Support for Per-Table Metrics in Alternator' from Amnon Heiman This series introduces per-table metrics support for Alternator. It includes the following commits: Add optional per-table metrics for Alternator Introduces a shared_ptr-based mechanism that allows Alternator to register per-table metrics. These metrics follow the table's lifecycle, similar to how CQL metrics are handled. The use of shared_ptr ensures no direct dependency between table stats and Alternator. Enable registration of stats objects per table Adds support for registering a stats object using a keyspace and table name. Per-table metrics are prefixed with alternator_table to differentiate them from per-shard metrics. Metrics are reported once per node, and those not meaningful at the table level (e.g. create/delete) are excluded. All metrics use the skip_when_empty flag. Update per-table metrics handling Adds a helper function to retrieve the stats object from a table schema. Updates both per-shard and per-table metrics, resulting in some code duplication. Add tests for per-table metrics Extends existing tests to also validate the per-table metrics. These tests ensure that the new metrics are correctly registered and updated. This series improves observability in Alternator by enabling fine-grained per-table metrics without disrupting existing per-shard metrics. No need to backport Fixes #19824 Closes scylladb/scylladb#24046 * github.com:scylladb/scylladb: alternator/test_metrics.py: Test the per-table metrics alternator/executor.cc: Update per-table metrics alternator/stats: Add per-table metrics replica/database.hh: Add alternator per-table metrics alternator/stats.hh: Introduce a per-table stats container	2025-06-08 10:42:05 +03:00
Botond Dénes	8498bd6376	Merge 'Replace container_to_vec with std::ranges' from Pavel Emelyanov The helper in question converts an iterable collection to a vector of fmt::to_string()-s of the collection elements. Patch the caller to use standard library and remove the helper. Closes scylladb/scylladb#24357 * github.com:scylladb/scylladb: api: Drop no longer used container_to_vec helper api: Use std::ranges to stringify collections api: Use std::ranges to convert std::set<sstring> to std::vector<string> api: Use db::config::data_file_directories()' vector directly api: Coroutinize get_live_endpoint()	2025-06-06 10:57:06 +03:00
Pavel Emelyanov	12420dc644	api: Shorten get_host_to_id_map() handler The handler does - gets host IDs from local token metadata - for each ID gets the host IP and generates IP:ID std::pair - converts the sequence of generated pairs into std::unordered_map - converts the unordered map into vector of jsonable key:value objects This patch removes the 3rd step and makes the needed jsonable object in step 2 directly, thus eliminating the interposing unordered_map creation. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24354	2025-06-06 10:54:23 +03:00
Pavel Emelyanov	428edd41f5	api: Make us of datablse::get_all_keyspaces() There are two places in the API that want to get the list of keyspace names. For that they call database::get_keyspaces() and then extract keys from the returned name to class keyspace map. There's a database::get_all_keyspaces() method that does exactly that. Remove the map_keys helper from the api/api.hh that becomes unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24353	2025-06-06 10:53:09 +03:00
Marcin Maliszkiewicz	2090e44283	storage_service: always wake up load balancer on update tablet metadata Lack of wakeup is error-prone, as it relies on a wakeup occurring elsewhere.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	ddc0656eb5	db: schema_applier: call destroy also when exception occurs Otherwise objects may be destroyed on wrong shard, and assert will trigger in ~sharded().	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	547bb1f663	db: replica: simplify seeding ERM during shema change We know that caller is running on shard 0 so we can avoid some extra boilerplate.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	97cdb72d4d	db: remove cleanup from add_column_family Since we abort now on failure during schema commit there is no need for cleanup as it only manages in-memory state. Explicit cf.stop was added to code paths outside of schema merging to avoid unnecessary regressions.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	d5075c70ef	db: abort on exception during schema commit phase As we have no way to recover from partial commit.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	858db822dc	db: make user defined types changes atomic The same order of creation/destruction is preserved as in the original code, looking from single shard point of view. create_types() is called on each shard separately, while in theory we should be able reuse results similarly as diff_rows(). But we don't introduce this optimization yet.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	5b2e4140cc	replica: db: make keyspace schema changes atomic Now all keyspace related schema changes are observable on given shard as they would be applied atomically. This is achieved by commit_on_shard() function being non-preemptive (no futures, no co_awaits). In the future we'll extend this to the whole schema and also other subsystems.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	556e89bc9d	db: atomically apply changes to tables and views In this commit we make use of splitted functions introduced before. Pattern is as follows: - in merge_tables_and_views we call some preparatory functions - in schema_applier::update we call non-yielding step - in schema_applier::post_commit we call cleanups and other finalizing async functions Additionally we introduce frozen_schema_diff because converting schema_ptr to global_schema_ptr triggers schema registration and with atomic changes we need to place registration only in commit phase. Schema freezing is the same method global_schema_ptr uses to transport schema across shards (via schema_registry cache).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	a27776b4ff	replica: make truncate_table_on_all_shards get whole schema from table_shards Before for views and indexes it was fetching base schema from db (and couple other properties). This is a problem once we introduce atomic tables and views deletion (in the following commit). Because once we delete table it can no longer be fetched from db object, and truncation is performed after atomically deleting all relevant tables/views/indexes. Now the whole relevant schema will be fetched via global_table_ptr (table_shards) object.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	ac254e9722	service: split update_tablet_metadata into two phases In following commits calls will be split in schema_applier.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	21a5a3c01f	service: pull out update_tablet_metadata from migration_listener It's not a good usage as there is only one non-empty implementation. Also we need to change it further in the following commit which makes it incompatible with listener code.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	92e3d69f79	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	1c8fd3a65d	service: simplify load_tablet_metadata and update_tablet_metadata - remove load_tablet_metadata(), instead we add wake_up_load_balancer flag to update_tablet_metadata(), it reduces number of public functions and also serves as a comment (removed comment with very similar meaning) - reimplement the code to not use mutate_token_metadata(), this way it's more readable and it's also needed as we'll split update_tablet_metadata() in following commits so that we can have subroutine which doesn't yield (for ensuring atomicity)	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	3119a02edd	db: don't perform move on tablet_hint reference This lambda is called several times so there should be no move. Currently the bug likely doesn't manifest as code does work only on shard 0.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	1ad14f02f1	replica: split add_column_family_and_make_directory into steps This is similar work as for drop_table in previous commit. add_column_family_and_make_directory() behaves exactly the same as before but calls to it in schema_applier will be replaced by calls directly to split steps. Other usages will remain intact as they don't need atomicity (like creating system tables at startup).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	141a5643e5	replica: db: split drop_table into steps This is done so that actual dropping can be an atomic step which could be composed with other schema operations, and eventually all subsystems modified via raft so that we could introduce atomic changes which span across different subsystems. We split drop_table_on_all_shards() into: - prepare_tables_metadata_change_on_all_shards() - prepare_drop_table_on_all_shards() - drop_table() - cleanup_drop_table_on_all_shards() prepare_tables_metadata_change_on_all_shards() is necessary because when applying multiple schema changes at once (e.g. drop and add tables) we need to lock only once. We add legacy_drop_table_on_all_shards() which behaves exactly like old drop_table_on_all_shards() to be compatible with code which doesn't need to play with atomicity. Usages of legacy_drop_table_on_all_shards() in schema_applier will be replaced with direct calls to split functions in the following commits - that's the place we will take advantage of drop_table not yielding (as it returns void now).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	2bae38e252	db: don't move map references in merge_tables_and_views() Since they are const it's not needed and misleading.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	85f19e165a	db: introduce commit_on_shard function This will be the place for all atomic schema switching operations. Note that atomicity is observed only from single shard point of view. All shards may switch at slightly different times as global locking for this is not feasible.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	b3730282c3	db: access types during schema merge via special storage Once we create types atomically the code which is before commit may depend on newly added types, so it has to access both old and new types. New storage called in_progress_types_storage was added.	2025-06-06 08:50:33 +02:00
Pavel Emelyanov	f5743c6afc	Merge 'test/alternator: make tests runnable on DynamoDB Local' from Nadav Har'El The Alternator tests should pass on Alternator (of course), and almost always also on DynamoDB to verify that the tests themselves are correct and don't just enshrine Alternator's incorrect behavior. Although much less important, it is sometimes useful to be able to check if the test also pass on other DynamoDB clones, especially "DynamoDB Local" - Amazon's DynamoDB mock written in Java. In issue https://github.com/scylladb/scylladb/issues/7775 we noted that some of our tests don't actually pass on DynamoDB Local, for different reasons, but at the time that issue was created most of the tests did work. However, checking now on a newer version of DynamoDB Local (2.6.1), I notice that _all_ tests failed because of some silly reasons that are easy to fix - and this is what the two patches in this series fix. After these fixes, most of the Alternator tests pass on DynamoDB Local. But not all of them - #7775 is still open. No backport needed - these are just test framework improvements for developers. Closes scylladb/scylladb#24361 * github.com:scylladb/scylladb: test/alternator: any response from healthcheck means server is alive test/alternator: fall back to legal-looking access key id	2025-06-06 08:50:58 +03:00
Nadav Har'El	b0f98f7d4b	mv: test that view's SELECT automatically includes primary key Both ScyllaDB's and Datastax's documentation suggest that when creating a view with CREATE MATERIALIZED VIEW, its SELECT clause doesn't need to list the view's primary key columns because those are selected automatically. For example, our documentation has an example in https://docs.scylladb.com/manual/stable/features/materialized-views.html ``` CREATE MATERIALIZED VIEW building_by_city2 AS SELECT meters FROM buildings WHERE city IS NOT NULL PRIMARY KEY(city, name); ``` Note how the primary key columns - city and name - are not explicitly SELECTed. I just discovered that while this behavior was indeed true in Cassandra 3 (and still true in ScyllaDB), it actually got broken in Cassandra 4 and 5. I reported this apprent regression to Cassandra (CASSANDRA-20701), and proposing the regression test in this patch to ensure that Scylla can't suffer a similar regression in the future. The new test passes on ScyllaDB and Cassandra 3, but fails on Cassandra 4 and 5 (and therefore tagged with "cassandra_bug"). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24399	2025-06-05 16:52:49 +02:00
Piotr Szymaniak	de96c28625	alternator: Add support for TTL when using tablets Support for TTL-based data removal when using tablets. The essence of this commit is a separate code path for finding token ranges owned by the current shard for the cases when tablets are used and not vnodes. At the same time, the vnodes-case is not touched not to cause any regressions. The TTL-caused data removal is normally performed by the primary replica (both when using vnodes and tablets). For the tablets case, the already-existing method tablet_map::get_primary_replica(tablet_id) is used to know if a shard execuring the TTL-related data removal is the primary replica for each tablet. A new method tablet_map::get_secondary_replica(tablet_id) has been added. It is needed by the data invalidation procedure to remove data when the primary replica node is down - the data is then removed by the secondary replica node. The mechanism is the same as in the vnodes case. Since alternator now supports TTL, the test `test_ttl_enable_error_with_tablets` has been removed. Also, tests in the test_ttl.py have been made to run twice, once with vnodes and once with tablets. When run with tablets, the due to lack of support for LWT with tablets (#18068), tests use 'system:write_isolation' of 'unsafe_rmw'. This approach allows early regression testing with tablets and is meant only as a tentative solution. Fixes scylladb/scylladb#16567 Closes scylladb/scylladb#23662	2025-06-05 17:39:29 +03:00
Amnon Heiman	760c8c3333	alternator/test_metrics.py: Test the per-table metrics This patch adds tests for the newly added per-table metrics. It mainly redoes existing tests, but verifies that the per-table metrics are updated correctly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 15:12:19 +03:00
Amnon Heiman	3ad7a24eee	alternator/executor.cc: Update per-table metrics This patch adds support for updating per-table metrics. It introduces a helper function that retrieves the stats object from a table schema. The code uses a lw_shared_ptr for the stats object to ensure safe updates even if the table holding it has been deleted. There is some duplication in the updated code, as both per-shard and per-table metrics are updated. The rmw_operation::execute function now accepts two stats objects: one for the global metrics and one for the per-table metrics. The use of execute was also modified—rather than modifying the WCU directly, a parameter is used so both global and per-table stats can be updated. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 15:12:13 +03:00
Amnon Heiman	d6afd42342	alternator/stats: Add per-table metrics This patch allows registering a stats object per table. The per-table stats object needs its metrics registry to be part of the table's lifecycle, but there could be a scenario in which a table is already deleted while some Alternator operations are still in progress. To handle this, the patch separates the registry from the metrics holder. It is safe to modify a parameter that is not registered. Metrics registration is performed via functions instead of the constructor. The registration accepts a keyspace and table name as parameters. The per-table metrics use an alternator_table prefix to distinguish them from their per-shard equivalents. The metrics are aggregated and reported once per node. Metrics that do not make sense to report per table (such as create and delete) are not registered. All metrics are marked with skip_when_empty. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 14:44:03 +03:00
Amnon Heiman	005df0c5c4	replica/database.hh: Add alternator per-table metrics This patch adds optional per-table metrics for Alternator. Like CQL, some of Alternator's statistics should be per-table. The shared_ptr allows Alternator to register such metrics in a way that makes them part of the table's lifecycle. Using a shared_ptr does not create dependencies between the table_stats and Alternator. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 14:38:14 +03:00
Amnon Heiman	af262317b5	alternator/stats.hh: Introduce a per-table stats container A per-table stats container will be used to safely hold alternator per-table stats. It is build in a way that even if the metrics it holds are no longer registered, it is still safe to use. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-06-05 14:38:14 +03:00
Andrzej Jackowski	e6eb741e95	test: dtest: add dict support to populate in scylla_cluster.py Co-authored-by: Evgeniy Naydanov <evgeniy.naydanov@scylladb.com>	2025-06-05 08:20:09 +02:00
Andrzej Jackowski	e3f052d6fb	test: dtest: copied get_node_ip from dtests to scylla_cluster.py Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:20:09 +02:00
Andrzej Jackowski	40e71ad1e6	test: dtest: copy run_rest_api from dtests to cluster.py Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:20:09 +02:00
Andrzej Jackowski	3da86f04a5	test: dtest: copy run_in_parallel from dtests to data.py Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:19:54 +02:00
Andrzej Jackowski	a1b1d810f9	test: audit: copy unmodified audit_test.py from dtests Copied the entire audit_test.py from scylladb/scylla-dtest, to remove the entire file from scylla-dtest after this patch series is merged. The motivation is to move entire audit testing to from dtests, to make it easier to maintain and more reliable. Changed suite.yaml, to prevent audit_test.py from running because audit_test.py needs improvement before it starts passing. Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2025-06-05 08:19:44 +02:00
Ernest Zaslavsky	a39b773d36	encryption_test: Catch exact exception Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed. Start catching the exact exception that we expect to be thrown. Maybe somewhat related to https://github.com/scylladb/scylladb/issues/22628 Fixes: https://github.com/scylladb/scylladb/issues/24145 reapplies reverted: https://github.com/scylladb/scylladb/pull/24065 Should be backported to 2025.2. Closes scylladb/scylladb#24242	2025-06-05 08:32:51 +03:00
Benny Halevy	8b387109fc	disk_space_monitor: add space_source_registration Register the current space_source_fn in an RAII object that resets monitor._space_source to the previous function when the RAII object is destroyed. Use space_source_registration in database_test:: mutation_dump_generated_schema_deterministic_id_version to prevent use-after-stack-return in the test. Fixes #24314 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24342	2025-06-04 16:25:24 +03:00
Ernest Zaslavsky	1446f57635	minio: update CLI usage, remove deprecated `mc` options Replace phased-out `mc` command options with supported alternatives. Ensures compatibility with the latest MinIO version. Closes scylladb/scylladb#24363	2025-06-04 16:22:48 +03:00
Anna Stuchlik	8b989d7fb1	doc: add the upgrade guide from 2025.1 to 2025.2 This commit adds the upgrade guide from version 2025.1 to 2025.2. Also, it removes the upgrade guides existing for the previous version that are irrelevant in 2025.2 (upgrade from OSS 6.2 and Enterprise 2024.x). Note that the new guide does not include the "Enable Consistent Topology Updates" page, as users upgrading to 2025.2 have consistent topology updates already enabled. Fixes https://github.com/scylladb/scylladb/issues/24133 Fixes https://github.com/scylladb/scylladb/issues/24265 Closes scylladb/scylladb#24266	2025-06-04 14:00:05 +03:00
Szymon Malewski	5969809607	mapreduce_service: Prevent race condition In parallelized aggregation functions super-coordinator (node performing final merging step) receives and merges each partial result in parallel coroutines (`parallel_for_each`). Usually responses are spread over time and actual merging is atomic. However sometimes partial results are received at the similar time and if an aggregate function (e.g. lua script) yields, two coroutines can try to overwrite the same accumulator one after another, which leads to losing some of the results. To prevent this, in this patch each coroutine stores merging results in its own context and overwrites accumulator atomically, only after it was fully merged. Comparing to the previous implementation order of operands in merging function is swapped, but the order of aggregation is not guaranteed anyway. Fixes #20662 Closes scylladb/scylladb#24106	2025-06-04 13:47:11 +03:00
Nadav Har'El	6cbcabd100	alternator: hide internal tags from users The "tags" mechanism in Alternator is a convenient way to attach metadata to Alternator tables. Recently we have started using it more and more for internal metadata storage: * UpdateTimeToLive stores the attribute in a tag system:ttl_attribute * CreateTable stores provisioned throughput in tags system:provisioned_rcu and system:provisioned_wcu * CreateTable stores the table's creation time in a tag called system:table_creation_time. We do not want any of these internal tags to be visible to a ListTagsOfResource request, because if they are visible (as before this patch), systems such as Terraform can get confused when they suddenly see a tag which they didn't set - and may even attempt to delete it (as reported in issue #24098). Moreover, we don't want any of these internal tags to be writable with TagResource or UntagResource: If a user wants to change the TTL setting they should do it via UpdateTimeToLive - not by writing directly to tags. So in this patch we forbid read or write to any tag that begins with the "system:" prefix, except one: "system:write_isolation". That tag is deliberately intended to be writable by the user, as a configuration mechanism, and is never created internally by Scylla. We should have perhaps chosen a different prefix for configurable vs. internal tags, or chosen more unique prefixes - but let's not change these historic names now. This patch also adds regression tests for the internal tags features, failing before this patch and passing after: 1. internal tags, specifically system:ttl_attribute, are not visible in ListTagsOfResource, and cannot be modified by TagResource or UntagResource. 2. system:write_isolation is not internal, and be written by either TagResource or UntagResource, and read with ListTagsOfResource. This patch also fixes a bug in the test where we added more checks for system:write_isolation - test_tag_resource_write_isolation_values. This test forgot to remove the system:write_isolation tags from test_table when it ended, which would lead to other tests that run later to run with a non-default write isolation - something which we never intended. Fixes #24098. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24299	2025-06-03 20:40:50 +03:00
Pavel Emelyanov	37e6ff1a3c	Merge 'test.py: cql: run tests using bare pytest command' from Evgeniy Naydanov Create a custom pytest test collector for .cql files and move CQL test execution logic from `CQLApprovalTest` class and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()` method. In result, the only difference between CQLApproval and Python suite types is suffixes of test files. Also there is a separate commit to remove dead code: There is `write_junit_failure_report()` method in Test class which was used to generate a JUnitXML report. But it became a dead code after removal of `write_junit_report()` function in `1e1d213592` to avoid duplication of error reporting in Jenkins (see https://github.com/scylladb/scylladb/issues/23220.) This commit removes this method and all its implementations in subclasses. Closes scylladb/scylladb#24301 * github.com:scylladb/scylladb: test.py: cql: don't exit from pytest session on failed CQL test.py: cql: run tests using bare pytest command test.py: python: set test.id according to --run_id argument test.py: python: pass --tmpdir from test.py to all Python tests test.py: remove dead code after removing of write_junit_report()	2025-06-03 19:32:06 +03:00
Pavel Emelyanov	24f430c6d2	Merge 'test.py: dtest: port next_gating tests from auth_roles_test.py' from Evgeniy Naydanov Copy `auth_roles_test.py` from scylla-dtest test suite, remove all not next_gating tests from it, and make it works with `test.py` As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers. Enable the test in `suite.yaml` (run in dev mode only.) Closes scylladb/scylladb#24343 * github.com:scylladb/scylladb: test.py: dtest: make auth_roles_test.py run using test.py test.py: dtest: add wait_for_any_log() to tools/log_utils.py test.py: dtest: add part of tools/assertions.py test.py: dtest: pickup latest code for retrying.py from dtest test.py: dtest: copy unmodified auth_roles_test.py	2025-06-03 18:54:47 +03:00
Patryk Jędrzejczak	8756c233e0	test: test_raft_recovery_user_data: disable hinted handoff The test is currently flaky, writes can fail with "Too many in flight hints: 10485936". See scylladb/scylladb#23565 for more details. We suspect that scylladb/scylladb#23565 is caused by an infrastructure issue - slow disks on some machines we run CI jobs on. Since the test fails often and investigation doesn't seem to be easy, we first deflake the test in this patch by disabling hinted handoff. For replacing nodes, we provide `cfg` because there should have been `cfg` in the first place. The test was correct anyway because: - `tablets_mode_for_new_keyspaces` is set to `true` by default in test/cluster/suite.yaml, - `endpoint_snitch` is set to `GossipingPropertyFileSnitch` by default if the property file is provided in `ScyllaServer.__init__`. Ref scylladb/scylladb#23565 We should backport this patch to 2025.2 because this test is also flaky on CI jobs using 2025.2. Older branches don't have this test. Closes scylladb/scylladb#24364	2025-06-03 17:48:42 +02:00
Nadav Har'El	ac70e34de9	test/alternator: verify that DeleteItem returns an empty object A user on StackOverflow (https://stackoverflow.com/questions/79650278) reported that DeleteItem returns the apropriate response (an empty object) on DynamoDB, but doesn't on "DynamoDB Local" (Amazon's local mock of DynamoDB). I wrote the test in this patch to make sure that Alternator doesn't have this bug, and indeed it doesn't: When DeleteItem is used without any option that asks for additional output, its reponse is, as expected, an empty object. As usual, the new test passes on both Alternator and AWS DynamoDB. (I didn't actually test on DynamoDB Local, I have some problems with running that, but it doesn't matter, we have no intention of testing DynamoDB Local). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24359	2025-06-03 18:47:34 +03:00
Avi Kivity	744015cf26	test.py: allow cmake configuration and ./configure.py configuration to coexist Cmake emits its build.ninja into build/, while configure.py emits build.ninja into ./. test.py uses this difference to choose the directory structure to test. The problem is that vscode will randomly call cmake to understand the directory structure, so we end up with both build.ninja set up. Invert the logic to look for ./build.ninja to determine the mode (instead of build/build.ninja which can exist even if the user uses traditional configuration). It can still happen that a stray ./build.ninja exists (for example due to switching branches), but that is rarer than having vscode auto-create it. Closes scylladb/scylladb#24269	2025-06-03 16:46:41 +03:00
Piotr Dulikowski	f6669422e1	Merge 'test.py: refactor test facades for better error handling' from Andrei Chekun Switching to f-string formatting to simplify the code and to unify it with a general approach for formatting strings. If the log file absent or empty test fails with an error regarding a missing boost log file, however, it's not helpful since it's not a root cause of the fail. Adding logic to log this issue as a warning in a pytest's log file and continue with providing results to the pytest itself. Closes scylladb/scylladb#24307 * github.com:scylladb/scylladb: test.py: enhance boost_facade missing log file handling test.py: switch using f-string instead format in facades	2025-06-03 14:03:07 +02:00
Pavel Emelyanov	96029c7c93	Update seastar submodule * seastar d7ff58f2...26badcb1 (22): > http/client: Skip HEAD reply body processing > httpd: Remove unused connection::_req member > httpd: Don't write body for HEAD replies > http: Move trailing chunk write into reply.cc > http_client: Add ECONNRESET to retryable errors > stall_detector: no backtrace if exception > http: Add test for "aborted" client > http: in the client, fix malforming of requests with zero-sized bodies > http: Track bytes read from a response > http: Add test for improper client handling of aborted requests > aio_storage_context: Rename iocb_pool::_iocb_pool to _all_iocbs > resource: Add some debug-level logging to memory allocation > resource: Rework sysconf memory fallback > resource: Indentation fix after previous patch > resource: Calculate available memory from NUMA nodes > resource: Move NUMA nodes vector evaluation up > reactor: Drop _reuseport boolean > reactor: Simplify network stack creation and initialization > reactor: Remove write-only _thread_id > reactor: Keep task-queues in std::array instead of static_vector > reactor: Mark _id and task_queue::_id const > memory: Report oversized alloc count as metric scylla-gdb update included: The reactor::_task_queues can be std::array or unique ptrs. Also check the tq_ptr for being nullptr, as array doesn't have "size" only "capacity" and can have non-registered groups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24294	2025-06-03 13:47:05 +03:00
Nadav Har'El	e32559758a	test/alternator: any response from healthcheck means server is alive In the Alternator tests we check (in dynamodb_test_connect()) after every test that the server is still alive, so we can blaim the test that just ran if it crashes the server. We check the server's health using a simple GET response, which works on both DynamoDB and Alternator, e.g., ``` $ curl http://dynamodb.us-east-2.amazonaws.com/ healthy: dynamodb.us-east-2.amazonaws.com ``` However, it turns out that new versions of DynamoDB Local - Amazon's local mock of DynamoDB, for some reason insists that all requests - including this health check - must be signed, so our unsigned health request is rejected with error 400, saying the request must be signed. So the current code which insists that the response have error code 200, fails and the test incorrectly things that DynamoDB Local crashed during the test. The fix is trivial: Just don't check that the error code is 200. Any HTTP response from the server means it is still alive! If the server is not alive, we will get an exception, not any HTTP response, and this will lead the code to the "server has crashed" case. Refs #7775 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-03 12:25:51 +03:00
Nadav Har'El	9732545958	test/alternator: fall back to legal-looking access key id When the Alternator tests run against Scylla, they figure out (using CQL) the correct username and password needed to connect. When it can't, we fell back to some silly pair 'unknown_user', 'unknown_secret', assuming that the server won't check it anyway. It turns out that if we want to run tests against new version of DynamoDB Local (Amazon's local mock of DynamoDB), it indeed doesn't authentication, but starting in DynamoDB Local 2.0, it does check that the access key ID (the username) itself is valid, and considers "unknown_user" to be invalid because it contains an underscore - AWS_ACCESS_KEY_ID must only contains letters and numbers. See https://repost.aws/articles/ARc4hEkF9CRgOrw8kSMe6CwQ/ for Amazon's explanation for this change in DynamoDB Local 2. The trivial fix is to remove the underscore from the silly username. After this patch, Alternator tests can connect to DynamoDB Local. They still can't complete correctly - this will be fixed in the next patch. Refs #7775 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-06-03 12:25:51 +03:00
Evgeniy Naydanov	f0d283afd7	test.py: cql: don't exit from pytest session on failed CQL There is the fixture in `test/cql/conftest.py` which checks CQL connection after each test and exit from pytest session if the connection was failed. For CQL tests it's simply no difference what to use: pytest.exit() or pytest.fail() because tests are executing one-by-one in separate pytest sessions. Change it to pytest.fail() for future integration into a single pytest session.	2025-06-03 07:54:51 +00:00
Evgeniy Naydanov	cdc4b520da	test.py: cql: run tests using bare pytest command Create a custom pytest test collector for .cql files and move CQL test execution logic from `CQLApprovalTest` class and `pylib/cql_repl/cql_repl.py` file to `CqlTest.runtest()` method. In result, the only difference between CQLApproval and Python suite types is suffixes of test files.	2025-06-03 07:54:51 +00:00
Evgeniy Naydanov	0fba0df4f6	test.py: python: set test.id according to --run_id argument test.py uses `Test.id` attribute to distinguish repeated tests in one run and pass it as `--run_id` CLI argument to pytest. Use this argument to set the test's `id` attribute inside pytest session to fix problem with paths to some test artifacts.	2025-06-03 07:54:51 +00:00
Michał Chojnowski	ea4d251ad2	compress: fix a use-after-free in `dictionary_holder::get_recommended_dict()` The function calls copy() on a foreign_ptr (stored in a map) which can be destroyed (erased from the map) before the copy() completes. This is illegal. One way to fix this would be to apply an rwlock to the map. Another way is to wrap the `foreign_ptr` in a `lw_shared_ptr` and extend its lifetime over the `copy()` call. This patch does the latter. Fixes scylladb/scylladb#24165 Fixes scylladb/scylladb#24174 Closes scylladb/scylladb#24175	2025-06-03 10:42:38 +03:00
Piotr Dulikowski	f5b18d275b	Merge 'test/boost: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek This PR adjusts existing Boost tests so they respect the invariant introduced by enabling `rf_rack_valid_keyspaces` configuration option. We disable it explicitly in more problematic tests. After that, we enable the option by default in the whole test suite. Fixes scylladb/scylladb#23958 Backport: backporting to 2025.1 and 2025.2 to be able to test the implementation there too. Closes scylladb/scylladb#23802 * github.com:scylladb/scylladb: test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity	2025-06-03 08:43:34 +02:00
Evgeniy Naydanov	ac972231fa	test.py: python: pass --tmpdir from test.py to all Python tests `--tmpdir` CLI argument is used to point to the directory with logs and other test artifacts. It has default values both in test.py and pytest (`test/conftest.py`). These values are the same. But for non-default values it's required to pass it from test.py to pytest explicitly. This done for Topology tests, but not for all Python test suites. The commit fixes the problem by adding the argument in `_prepare_pytest_command()` method of the base `PythonTest` class.	2025-06-03 05:45:05 +00:00
Evgeniy Naydanov	17401aaf31	test.py: remove dead code after removing of write_junit_report() There is `write_junit_failure_report()` method in Test class which was used to generate a JUnitXML report. But it became a dead code after removal of `write_junit_report()` function in `1e1d213592` to avoid duplication of error reporting in Jenkins (see #23220.) This commit removes this method and all its implementations in subclasses.	2025-06-03 02:28:41 +00:00
Pavel Emelyanov	eb5160cb4d	api: Drop no longer used container_to_vec helper All callers are patched to use std::ranges. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:09:58 +03:00
Pavel Emelyanov	f6afc02951	api: Use std::ranges to stringify collections There are several endpoints that have collection of objects at hand and want a vector of corresponding strings. Use std::ranges library for conversion. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:09:56 +03:00
Pavel Emelyanov	b943902ff7	api: Use std::ranges to convert std::set<sstring> to std::vector<string> The column_family/get_sstables_for_key endpoint collects a set of sstable names and converts it to vector of strings using homebrew helper. The std::ranges convertor works just as nice. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:09:28 +03:00
Pavel Emelyanov	6809ab5198	api: Use db::config::data_file_directories()' vector directly The return value is std::vector<sstring>, there's no need to additionally convert it to std::vector<sstring>. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 20:06:33 +03:00
Pavel Emelyanov	06ee60c238	api: Coroutinize get_live_endpoint() To be summetrical with its get_down_endpoint() peer and to make further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-06-02 19:52:55 +03:00
Michał Chojnowski	dd878505ca	test: add test_sstable_compression_dictionaries_upgrade.py	2025-06-02 15:49:29 +02:00
Michał Chojnowski	d3cb873532	test.py: add --run-internet-dependent-tests Later, we will add upgrade tests, which need to download the previous release of Scylla from the internet. Internet access is a major dependency, so we want to make those tests opt-in for now.	2025-06-02 15:49:29 +02:00
Michał Chojnowski	5da19ff6a6	pylib/manager_client: add server_switch_executable Add an util for switching the Scylla executable during the test. Will be used for upgrade tests.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	1ff7e09edc	test/pylib: in add_server, give a way to specify the executable and version-specific config This will be used for upgrade tests. The cluster will be started with an older executable and without configs specific to newer versions.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	2ef0db0a6b	pylib: pass scylla_env environment variables to the topology suite I want to add an upgrade test under the topology suite. To work, it will have to know the path to the tested Scylla executable, so that it can switch the nodes to it. The path could be passed by various means and I'm not sure which what method is appropriate. In some other places (e.g. the cql suite) we pass the path via the `SCYLLA` environment variable and this patch follows that example. `PythonTestSuite` (parent class of `TopologySuite`) already has that variable set in `self.scylla_env`, and passes it around. However, `TopologySuite` uses its own `run()`, and so it implicitly overrides the decision to pass `self.scylla_env` down. This patch changes that, and after the patch we apply the `self.scylla_env` to the environment for topology tests. This might has some unforeseen side effects for coverage measurement, because AFAICS the (only) other variable in `self.scylla_env` is `LLVM_PROFILE_FILE`. But topology tests don't run Scylla executables themselves (they only send command to the cluster manager started externally), so I figure there should be no change.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	34098fbd1f	test/pylib: add get_scylla_2025_1_executable() Adds a function which downloads and installs (in `~/.cache`) the Scylla 2025.1, for upgrade tests. Note: this introduces an internet dependency into pylib, AFAIK the first one. We already have some other code for downloading existing Scylla releases, written for different purposes, in `cqlpy/fetch_scylla.py`. I made zero effort to reuse that in any way. Note: hardcoding the package version might be uncool, but if we want "better" version selection (e.g. the newest patch version in the given branch), we should have a separate library (or web service) for that, and share it with CCM/SCT. If we add a separate automatic version selection mechanism here, we are going to end up with yet another half-broken Scylla version selector, with yet different syntax and semantics than the other ones. We never clear the downloaded and unpacked files. This could become a problem in the future. (At which point we can add some mechanism that deletes cached archives downloaded more than a week ago.)	2025-06-02 15:03:08 +02:00
Michał Chojnowski	cc7432888e	pylib/scylla_cluster: give a way to pass executable-specific options to nodes I'm trying to adapt pylib to multi-version tests. (Where the Scylla cluster is upgraded to a newer Scylla version during the test). Before this patch, the initial config (where "config" == yaml file + CLI args) of the nodes is hardcoded in scylla_cluster.py. The problem is that this config might not apply to past versions, so we need some way to give them a different config. (For example, with the config as it is before the patch, a Scylla 2025.1 executable would not boot up because it does not know the `group0_voter_handler` logger). In this patch, we create a way to attach version-specific config to the executable passed to ScyllaServer.	2025-06-02 15:03:08 +02:00
Michał Chojnowski	63218bb094	dbuild: mount "$XDG_CACHE_HOME/scylladb" We will use it to keep a cache of artifact downloads for upgrade tests, across dbuild invocations.	2025-06-02 15:03:08 +02:00
Andrei Chekun	738cbc07b5	test.py: enhance boost_facade missing log file handling If the log file absent or empty test fails with an error regarding a missing boost log file, however, it's not helpful since it's not a root cause of the fail. Adding logic to log this issue as a warning in a pytest's log file and continue with providing results to the pytest itself.	2025-06-02 12:17:10 +02:00
Andrei Chekun	5f6740c1fa	test.py: switch using f-string instead format in facades Switching to f-string formatting to simplify the code and to unify it with a general approach for formatting strings.	2025-06-02 12:16:47 +02:00
Pavel Emelyanov	7fef2c4f61	Merge 'test.py: fix metrics gathering' from Andrei Chekun Move of the run_process done in https://github.com/scylladb/scylladb/pull/24091 was not fully correct. The method run_process was not overridden in the class ResourceGatherOn, so no metrics are collected at all. Additionally, fix metrics DB location second time. Closes scylladb/scylladb#24306 * github.com:scylladb/scylladb: test.py: fix metrics DB location test.py: fix the possibility to gather resource metrics for test	2025-06-02 13:12:42 +03:00
Botond Dénes	e82b0dff3e	Merge 'Move mutation_fragment_v2::kind into mutation_fragment_v2::data, mutation_fragment::kind into mutation_fragment::data' from Radosław Cybulski Move mutation_fragment_v2::kind field into mutation_fragment_v2::data. Move mutation_fragment::kind field into mutation_fragment::data. In both cases the move reduces size of the object by half (to 8 bytes). On top of testsuite this patch was tested manually. First patched scylla was run. A keyspace and a table was created, with columns TEXT, INT, DOUBLE, BOOLEAN and TIMESTAMP. One row was inserted, `select ` was executed to make sure it's there. Then scylla was terminated and non-patched scylla was run, another row was inserted and `select ` was run to verify both rows exist. After this patched scylla was against started, third row was inserted and final `select ` was done to verify all three rows are there. This is partial fix to https://github.com/scylladb/scylla-enterprise/issues/5288 issue. Closes scylladb/scylladb#23452 github.com:scylladb/scylladb: Move mutation_fragment::kind into data object Make mutation_fragment::kind enum 1 byte size Move mutation_fragment_v2::kind into data object Make mutation_fragment_v2::kind enum 1 byte size	2025-06-02 10:57:17 +03:00
Evgeniy Naydanov	e780164a67	test.py: dtest: make auth_roles_test.py run using test.py As a part of the porting process, remove unused imports and markers, remove non-next_gating tests, and code for old ScyllaDB versions. Enable the test in suite.yaml (run in dev mode only)	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	145c2fed97	test.py: dtest: add wait_for_any_log() to tools/log_utils.py Copy wait_for_any_log() function from dtest tools/log_utils.py with few modifications: - Add type hints; - Change timeout for node.watch_log_for() calls from 0 to 0.1 because dtest shim's implementation uses asyncio.timeout() and 0 means not "one time" but "never run"; - Use set() instead of list() for `ret` variable; - Remove redundant `found` variable. - Remove `remaining` variable and use shallow copies to make the code more correct. As a side effect this makes the TimeoutError message more correct too; - Use f-string formatting for TimeoutError message;	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	ff2aea7e5b	test.py: dtest: add part of tools/assertions.py Copy few assertion functions from dtest tools/assertions.py: - assertion_exception() - assertion_invalid() - assertion_one() - assertion_all()	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	9d70b6307b	test.py: dtest: pickup latest code for retrying.py from dtest Sync retrying.py with dtest.	2025-06-02 05:14:41 +00:00
Evgeniy Naydanov	40464faef3	test.py: dtest: copy unmodified auth_roles_test.py The test is disabled in suite.yaml	2025-06-02 05:14:41 +00:00
Jenkins Promoter	7d562c24b1	Update pgo profiles - aarch64	2025-06-01 04:45:06 +03:00
Jenkins Promoter	75cf16afa2	Update pgo profiles - x86_64	2025-06-01 04:31:56 +03:00
Botond Dénes	c52aec3d2f	Merge 'tablets: fix missing data after tablet merge ' from Raphael Raph Carvalho Consider the following scenario: 1) let's assume tablet 0 has range [1, 5] (pre merge) 2) tablet merge happens, tablet 0 has now range [1, 10] 3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5] 4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time 5) replica service is asked to consume range [1, 10] of tablet 0 (post merge) We have two possible outcomes: With cache bypass: 1) cache reader is bypassed 2) sstable reader is created on range [1, 10] 3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10] With cache: 1) cache reader is created 2) finds partition with token 5 is cached 3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0) 4) incremental selector consumes the pre-merge sstable spanning range [1, 5] 4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached 4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed. So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read. This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets. Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution. Fixes: https://github.com/scylladb/scylladb/issues/23313 This change needs to be backported to all supported versions which implement tablet merge. Closes scylladb/scylladb#24287 * github.com:scylladb/scylladb: replica: Fix range reads spanning sibling tablets test: add reproducer and test for mutation source refresh after merge tablets: trigger mutation source refresh on tablet count change	2025-05-30 15:37:29 +03:00
Anna Stuchlik	28cb5a1e02	doc: add OS support for ScyllaDB 2025.2 This commit adds the information about support for platforms in ScyllaDB version 20252. Fixes https://github.com/scylladb/scylladb/issues/24180 Closes scylladb/scylladb#24263	2025-05-30 12:23:59 +03:00
Calle Wilund	942477ecd9	encryption/utils: Move encryption httpclient to "general" REST client Fixed #24296 While the HTTP client used for REST calls in AWS/GCP KMS integration (EAR) is not general enough to be called a HTTP client as such, it is general enough to be called a REST client (limited to stateless, single-op REST calls). Other code, like general auth integrations (hello Azure) and similar could reuse this to lessen code duplication. This patch simply moves the httpclient class from encryption to "rest" namespace, and explicitly "limits" it to such usage. Making an alias in encryption to avoid touching more files than needed. Closes scylladb/scylladb#24297	2025-05-30 12:21:51 +03:00
Pavel Emelyanov	a65ffdd0df	test/result_utils: Do not assume map_reduce reducing order When map_reduce is called on a collection, one shouldn't expect that it processes the elements of the collection in any specific order. Current test of map-reduce over boost outcome assumes that if reduce function is the string concatenation, then it would concatenate the given vector of strings in the order they are listed. That requirement should be relaxed, and the result may have reversed concatentation. Fixes scylladb/scylladb#24321 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24325	2025-05-30 09:38:59 +02:00
Michael Litvak	3a1be33143	test_cdc_generation_publishing: fix to read monotonically The test test_multiple_unpublished_cdc_generations reads the CDC generation timestamps to verify they are published in the correct order. To do so it issues reads in a loop with a short sleep period and checks the differences between consecutive reads, assuming they are monotonic. However the assumption that the reads are monotonic is not valid, because the reads are issued with consistency_level=ONE, thus we may read timestamps {A,B} from some node, then read timestamps {A} from another node that didn't apply the write of the new timestamp B yet. This will trigger the assert in the test and fail. To ensure the reads are monotonic we change the test to use consistency level ALL for the reads. Fixes scylladb/scylladb#24262 Closes scylladb/scylladb#24272	2025-05-30 08:35:56 +02:00
Pavel Emelyanov	086777e5de	Merge 'test.py: python: run tests using bare pytest command' from Evgeniy Naydanov Main change is splitting logic of `PythonTest.run()` method into `PythonTest.run_ctx()` context manager and `PythonTest.run()` method itself and add the `host` fixture which uses `PythonTest.run_ctx()` context manager to setup and teardown ScyllaDB node if `--test-py-init` argument is used. Otherwise, this fixture returns a value of `--host` CLI argument. Use dynamic scope provided by `testpy_test_fixture_scope()` function instead of `session` to maintain compatibility with `test.py` and `./run` scripts. Other related changes: * Add utility `get_testpy_test()` function to `pylib.suite.base` which combines all required steps to create an instance of `Test` class and rework `testpy_test` fixture to use it. * Switch to use dynamic fixture scope controlled by `--test-py-init` CLI argument to improve compatibility with test.py. And because in test.py mode the scope is `session`, also change default event loop scope to `session`. * Convert `get_valid_alternator_role()` to fixture to have more control on the scope of the cache used. Additionally, function `new_dynamodb_session()` was also converted to a fixture, because it uses `get_valid_alternator_role()`. * Replace dups of `cql` and `this_dc` fixtures in `rest_api` and `pylib/cql_repl` with imports from `cqlpy`. * Change `build_mode` fixture to return "unknown" if no --mode arguments provided (this is mainly for alternator and cqlpy tests) * Create a parent directory for a test log file just before opening this file in `run_test()` function instead of having this as a side effect in `Test.__init__()`. And changes that remove pytest CLI argument duplicates to be able to run tests from different test suites in one pytest session: * Add 3 supplementary functions to `test.pylib.suite.python`: `add_host_option()` (which adds `--host` options to pytest session), `add_cql_connection_options()` (which adds `--port`, and `--ssl`), and `--add-s3-options` (which adds options related to S3 connection.) Each function decorated with `@cache` decorator to be executed once per pytest session and avoid CLI options duplication for runs which executes `alternator`, `cqlpy`, `rest_api`, or `broadcast_tables` in one pytest session. * Move `--auth_username` and `--auth_password` options from `cluster/conftest.py` to add_scylla_cql_connection_options() and slightly rework `cql` fixture to support these options. * Remove `--input`, `--output`, and `--keep-tmp` pytest CLI opionts from `cluster/object_store/conftest.py` because they are not used in these suite. * Remove `--omit-scylla-output` CLI option from pytest argparser. Instead, remove it from `sys.argv` in `cqlpy/run.py`. Also, no need to check this option in `alternator/run`. Closes scylladb/scylladb#23849 * github.com:scylladb/scylladb: test.py: python: run tests using bare pytest command test.py: rework testpy_test fixture test.py: alternator: convert get_valid_alternator_role() to fixture test.py: python: split logic of PythonTest.run() test.py: add credentials options to add_cql_connection_options() test.py: python: remove dups of cql and this_dc fixtures test.py: remove duplication of pytest CLI options test.py: remove unused CLI options test.py: remove `--omit-scylla-output` from pytest argparser test.py: set build_mode to "unknown" if no --mode argument test.py: create directory for test log in run_test()	2025-05-30 08:48:43 +03:00
Botond Dénes	7db956965e	mutation/mutation_compactor: cache regular/shadowable max-purgable in separate members Max purgeable has two possible values for each partition: one for regular tombstones and one for shadowable ones. Yet currently a single member is used to cache the max-purgeable value for the partition, so whichever kind of tombstone is checked first, its max-purgeable will become sticky and apply to the other kind of tombstones too. E.g. if the first can_gc() check is for a regular tombstone, its max-purgeable will apply to shadowable tombstones in the partition too, meaning they might not be purged, even though they are purgeable, as the shadowable max-purgeable is expected to be more lenient. The other way around is worse, as it will result in regular tombstone being incorrectly purged, permitted by the more lenient shadowable tombstone max-purgeable. Fix this by caching the two possible values in two separate members. A reproducer unit test is also added. Fixes: scylladb/scylladb#23272 Closes scylladb/scylladb#24171	2025-05-29 22:52:08 +03:00
Avi Kivity	f0ec9dd8f2	Merge 'utils/logalloc: enforce the max contiguous allocation size limit' from Michał Chojnowski This series fixes the only known violation of logalloc's allocation size limits (in `chunked_managed_vector`), and then it make those limits hard. Before the series, LSA handles overly-large allocations by forwarding them to the standard allocator. After the series, an attempt to do an overly large allocations via LSA will trigger an `on_internal_error` instead. We do this because the allocator fallback logic turned out to have subtle and problematic accounting bugs. We could fix them, or we can remove the mechanism altogether. It's hard to say which choice is better. This PR arbitrarily makes the choice to remove the mechanism. This makes the logic simpler, at the risk of escalating some allocation size bugs to crashes. See the descriptions of individual commits for more details. Fixes scylladb/scylladb#23850 Fixes scylladb/scylladb#23851 Fixes scylladb/scylladb#23854 I'm not sure if any of this should be backported or not. The `chunked_managed_vector` fix could be backported, because it's a bugfix. It's an old bug, though, and we have never observed problems related to it. The changes to `logalloc` aren't supposed to be fixing any observable problem, so a backport probably has more risk than benefit in this case. Closes scylladb/scylladb#23944 * github.com:scylladb/scylladb: utils/logalloc: enforce LSA allocation size limits utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()	2025-05-29 22:11:41 +03:00
Szymon Malewski	18d237a393	alternator/executor: Added checks in `batch_write_item` This patch adds checks validating 'BatchWriteItem' requests mostly to avoid ugly fallback message. It changes request's behaviour in case of an empty array of WriteRequests - previously such an array was ignored and whole request might succeed, now it raises ValidationException, following the documentation and behaviour of DynamoDB. Patch includes tests in test_manual_requests (`test_batch_write_item_invalid_payload`, `test_batch_write_item_empty_request_list`) testing with several offending cases. Fixes #23233 Closes scylladb/scylladb#23878	2025-05-29 20:33:57 +03:00
Patryk Jędrzejczak	c21692f3a6	Merge 'token_range_vector: fragment' from Avi Kivity token_range_vector is a sequence of intervals of tokens. It is used to describe vnodes or token ranges owned by shards. Since tokens are bloated (16 bytes instead of 8), and intervals are bloated (40 byte of overhead instead of 8), and since we have plenty of token ranges, such vectors can exceed our allocation unit of 128 kB and cause allocation stalls. This series fixes that by first generalizing some helpers and then changing token_range_vector to use chunked_vector. Although this touches IDL, there is no compatibility problem since the encoding for vector and chunked_vector are identical. There is no performance concern since token_range_vector is never used on any hot path (hot paths always contain a partition key). Fixes #3335. Fixes #24115. No backport: minor performance fix that isn't a regression. Closes scylladb/scylladb#24205 * https://github.com/scylladb/scylladb: dht: fragment token_range_vector partition_range_compat: generalize wrap/unwrap helpers	2025-05-29 18:45:13 +02:00
Robert Bindar	c570941692	Add nodetool refresh --scope option This change adds the --scope option to nodetool refresh. Like in the case of nodetool restore, you can pass either of: * node - On the local node. * rack - On the local rack. * dc - In the datacenter (DC) where the local node lives. * all (default) - Everywhere across the cluster. as scope. The feature is based on the existing load_and_stream paths, so it requires passing --load-and-stream to the refresh command. Also, it is not compatible with the --primary-replica-only option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23861	2025-05-29 16:12:09 +03:00
Evgeniy Naydanov	0ee0e3f14d	test.py: python: run tests using bare pytest command Add the `host` fixture which uses `PythonTest.run_ctx()` context manager to setup and teardown ScyllaDB node if `--test-py-init` argument is used. Otherwise, this fixture returns a value of `--host` CLI argument. Use dynamic scope provided by `testpy_test_fixture_scope()` function instead of `session` to maintain compatibility with test.py and ./run scripts.	2025-05-29 12:33:41 +00:00
Evgeniy Naydanov	b67048f3ee	test.py: rework testpy_test fixture Add utility `get_testpy_test()` function to `pylib.suite.base` which combines all required steps to create an instance of `Test` class. Remove redundant `testpy_testsuite` fixture. Switch to use dynamic fixture scope controlled by `--test-py-init` CLI argument to improve compatibility with test.py. And because in test.py mode the scope is `session`, also change default event loop scope to `session`. The fixture is None for test.py mode. test.py runs tests file-by-file as separate pytest sessions, so, `session` scope is effectively close to be the same as `module` (can be a difference in the order.) In case of running tests with bare pytest command, we need to use `module` scope to maintain same behavior as test.py, since we run all tests in one pytest session.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	b65cb517b8	test.py: alternator: convert get_valid_alternator_role() to fixture Convert `get_valid_alternator_role()` to fixture to have more control on the scope of the cache used. Additionally, function `new_dynamodb_session()` was also converted to a fixture, because it uses `get_valid_alternator_role()`.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	1f94a9c052	test.py: python: split logic of PythonTest.run() Split logic of `PythonTest.run()` method into `PythonTest.run_ctx()` context manager and `PythonTest.run()` method itself. Done this to reuse setup/teardown code with bare pytest command runs.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	27cbfc77fb	test.py: add credentials options to add_cql_connection_options() Move `--auth_username` and `--auth_password` options from `cluster/conftest.py` to add_cql_connection_options() and slightly rework `cql` fixture to support these options.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	2bba4acdea	test.py: python: remove dups of cql and this_dc fixtures Replace dups of `cql` and `this_dc` fixtures in `rest_api` and `pylib/cql_repl` with imports from `cqlpy`.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	6780461df8	test.py: remove duplication of pytest CLI options Add 3 supplementary functions to `test.pylib.suite.python`: `add_host_option()` (which adds `--host` options to pytest session), `add_cql_connection_options()` (which adds `--port`, and `--ssl`), and `--add-s3-options` (which adds options related to S3 connection.) Each function decorated with `@cache` decorator to be executed once per pytest session and avoid CLI options duplication for runs which executes `alternator`, `cqlpy`, `rest_api`, or `broadcast_tables` in one pytest session.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	056c5db829	test.py: remove unused CLI options Remove `--input`, `--output`, and `--keep-tmp` pytest CLI opionts from `cluster/object_store/conftest.py` because they are not used in these suite.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	b7b68355ef	test.py: remove `--omit-scylla-output` from pytest argparser Remove `--omit-scylla-output` CLI option from pytest argparser. Instead, remove it from `sys.argv` in `cqlpy/run.py`. Also, no need to check this option in `alternator/run`.	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	f262d4c323	test.py: set build_mode to "unknown" if no --mode argument Change `build_mode` fixture to return "unknown" if no --mode arguments provided (this is mainly for alternator and cqlpy tests)	2025-05-29 12:15:28 +00:00
Evgeniy Naydanov	30d542b8f1	test.py: create directory for test log in run_test() Create a parent directory for a test log file just before opening this file in `run_test()` function instead of having this as a side effect in `Test.__init__()`.	2025-05-29 12:15:28 +00:00
Piotr Dulikowski	c8d52a4318	Merge 'test.py: dtest: port bypass_cache_test.py' from Evgeniy Naydanov Copy bypass_cache_test.py from scylla-dtest test suite and make it works with test.py As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers, and add missed `single_node` marker description to pytest.ini Enable the test in suite.yaml (run in dev mode only.) Also add missed `ScyllaCluster.nodetool()` method in dtest shim code. Closes scylladb/scylladb#24230 * github.com:scylladb/scylladb: test.py: dtest: make bypass_cache_test.py run using test.py test.py: dtest: add missed ScyllaCluster.nodetool() test.py: dtest: copy unmodified bypass_cache_test.py	2025-05-29 13:48:10 +02:00
Michał Chojnowski	cb02d47b10	utils/logalloc: enforce LSA allocation size limits In order to guarantee a decent upper limit on fragmentation, LSA only handles allocations smaller than 0.1 of a segment. Allocations larger than this limit are permitted, but they are not placed in LSA segments. Instead, they are forwarded to the standard allocator. We don't really have any use case for this "fallback". As far as I can tell, it only exists for "historical" reasons, from times where there were some data structures which weren't fully adapted to LSA yet. We don't the fallback to be used. Long-lived standard allocations are undesirable. They have higher internal fragmentation than LSA allocations, and they can cause external fragmentation in the standard allocator. So we want to eliminate them all. The only reason to keep the fallback is to soften the impact if some bug results in limit-exceeding LSA allocations happening in production. In principle, the fallback turns a crash (or something similarly drastic) into just a performance problem. However, it turns out that the fallback is buggy. Recently we had a bug which caused limit-exceeding LSA allocations to happen. And then it turned out that LSA reclaim doesn't deal fully correctly with evictable non-LSA allocations, and the dirty_memory_manager accounting for non-LSA allocations is completely wrong. This resulted in subtle, serious, and hard to understand stability problems in production. Arguably the biggest problem is that the "fallback" allocations weren't reported in any way. They were happening in some tests, but they were silently permitted, so nobody noticed that they should be eliminated. If we just had a rate-limited error log that reports fallback allocations, they would have never got into a release. So maybe we could fix the fallback, add more tests for it, add a warning for when it's used, and keep it. But this PR instead opts for removing the fallback mechanism altogether and failing fast. After the patch, if a non-conforming allocation happens, it will trigger an `on_internal_error`. With this, we risk a greater impact if some non-conforming allocations happen in production, but we make the system simpler. It's hard to say if it's a good tradeoff.	2025-05-29 13:05:08 +02:00
Piotr Dulikowski	555925c66b	Merge 'generic_server: transport: improve stats counting and shedding' from Marcin Maliszkiewicz The patch removes connection advertising functions and moves the logic to constructors and destructors, providing a more robust way of counting connections. This change was also necessary to allow skipping the connection process function during shedding, as the active connections counter needs to be decremented. The patch doesn't fix any active bug, just improves the flow. Backport: none, it's a cosmetic change Closes scylladb/scylladb#23890 * github.com:scylladb/scylladb: generic_server: make shutdown() return void generic_server: skip connection processing logic after shedding the connection transport: generic_server: remove no longer used connection advertising code transport: move new connection trace logs into connection class ctor/dtor transport: move cql connections counting into connection class ctor/dtor	2025-05-29 12:49:58 +02:00
Avi Kivity	c00824c7df	Merge 'transport: Implement SCYLLA_USE_METADATA_ID support' from Andrzej Jackowski Metadata id was introduced in CQLv5 to make metadata of prepared statement metadata consistent between driver and database. This commit introduces a protocol extension that allows to use the same mechanism in CQLv4. As CQLv5 is currently unsupported in ScyllaDb (as well as in some of the drivers), the motivation is to allow fixing https://github.com/scylladb/scylladb/issues/20860. This change: - Implement metadata::calculate_metadata_id() - Implement SCYLLA_USE_METADATA_ID protocol extension for CQLv4 - Added description of SCYLLA_USE_METADATA_ID in documentation - Add boost tests to confirm correctness of the function - Add python tests for table metadata change corner-cases Fixes scylladb/scylladb#20860 Also see related https://scylladb.atlassian.net/wiki/spaces/RND/pages/42238631/MetadataId+extension+in+CQLv4+Requirement+Document No backport needed (unless specifically requested by a customer), because there are existing workarounds for the issue Closes scylladb/scylladb#23292 * github.com:scylladb/scylladb: test: add tests for prepared statement metadata consistency corner cases transport: implement SCYLLA_USE_METADATA_ID support cql3: implement metadata::calculate_metadata_id()	2025-05-29 12:27:31 +03:00
Andrei Chekun	0c5676ffb4	test.py: fix metrics DB location This was already fixed, but unintentionally during rebases it was reverted and merged to master in the same PR.	2025-05-28 20:13:38 +02:00
Andrei Chekun	6e92791538	test.py: fix the possibility to gather resource metrics for test Move of the run_process done in #24091 was not fully correct. The method run_process was not overridden in the class ResourceGatherOn, so no metrics are collected at all.	2025-05-28 20:13:31 +02:00
Ran Regev	37854acc92	changed the string literals into the correct ones Fixes: #23970 use correct string literals: KMIP_TAG_CRYPTOGRAPHIC_LENGTH_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_LENGTH KMIP_TAG_CRYPTOGRAPHIC_USAGE_MASK_STR --> KMIP_TAGSTR_CRYPTOGRAPHIC_USAGE_MASK From https://github.com/scylladb/scylladb/issues/23970 description of the problem (emphasizes are mine): When transparent data encryption at rest is enabled with KMIP as a key provider, the observation is that before creating a new key, Scylla tries to locate an existing key with provided specifications (key algorithm & length), with the intention to re-use existing key, but the attributes sent in the request have minor spelling mistakes which are rejected by the KMIP server key provider, and hence scylla assumes that a key with these specifications doesn't exist, and creates a new key in the KMIP server. The issue here is that for every new table, ScyllaDB will create a key in the KMIP server, which could clutter the KMS, and make key lifecycle management difficult for DBAs. Closes scylladb/scylladb#24057	2025-05-28 13:52:30 +03:00
Pavel Emelyanov	2eed2e94ea	sstables_loader: Extend logging with recently added skip-cleanup When starting, the loader prints all its arguments into logs. Recently added skip-cleanup one is not included, but it's good to have one too. refs: #24139 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24206	2025-05-28 11:20:27 +03:00
David Garcia	9542bfd2b1	docs: enable ai chatbot docs: enable ai chatbot Closes scylladb/scylladb#24286	2025-05-28 11:04:25 +03:00
Yaron Kaikov	0831931fec	.github/workflows/conflict_reminder: reduce the amount of conflict reminder for every push event In order to avoid spamming PR author about conflicts, added a logic to verify during push events, that in case PR is already in draft mode, we will check when was the last notification, if it's less then 3 days, we will skip it Closes scylladb/scylladb#24289	2025-05-28 11:01:44 +03:00
Nadav Har'El	61581d458e	Merge 'vector_index: add custom index class from Michał Hudobski This PR adds a class that allows for validation (and in the future creating and querying) of custom indexes and implements it for vector indexes. Currently custom vector_index creation runs a usual index creation process. This PR does not change that, however it adds validation of the parameters that need to have certain values for the actual creation of the vector index in the future. The only thing left for the vector_index feature to work as intended should be the integration with the Vector Store service. This is a continuation of https://github.com/scylladb/scylladb/pull/23720 Refs: [VS-55 ](https://scylladb.atlassian.net/browse/VS-55) (Support setting index parametrs and similarity function in CREATE INDEX) Fixes: [VS-13](https://scylladb.atlassian.net/browse/VS-13) (Validate that the base type is numeric when creating the vector index) [VS-13]: https://scylladb.atlassian.net/browse/VS-13?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#24212 * github.com:scylladb/scylladb: test/cqlpy: remove xfail and add more vector tests vector_index: allow options when custom class is provided vector_index: add custom index and vector index classes	2025-05-28 10:42:29 +03:00
Raphael S. Carvalho	53df911145	replica: Fix range reads spanning sibling tablets We don't guarantee that coordinators will only emit range reads that span only one tablet. Consider this scenario: 1) split is about to be finalized, barrier is executed, completes. 2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet) 3) split is committed to group0, all replicas switch storage. 4) replica-side read is executed, uses a range which spans tablets. We could fix it with two-phase split execution. Rather than pushing the complexity to higher levels, let's fix incremental selector which should be able to serve all the tokens owned by a given shard. During split execution, either of sibling tablets aren't going anywhere since it runs with state machine locked, so a single read spanning both sibling tablets works as long as the selector works across tablet boundaries. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-27 22:39:40 -03:00
Michał Hudobski	195e6a82de	test/cqlpy: remove xfail and add more vector tests We have added validation for options and type of column for vector indexes. This commit adds tests for that validation.	2025-05-27 21:04:50 +02:00
Michał Hudobski	7a2b0179e8	vector_index: allow options when custom class is provided We have changed the validation for the custom index to not require the CUSTOM keyword when creating the index, only the custom class now we change the validation for options so that they match.	2025-05-27 21:04:50 +02:00
Michał Hudobski	3ab643a5de	vector_index: add custom index and vector index classes In this patch we add an abstract class, "custom_index", with a validate() method. Each CUSTOM INDEX class needs to implement a concrete subclass of custom_index which is used to validate if this type of custom index class may be used, and whether the optional parameters passed to it are valid. We change the existing CUSTOM INDEX validation code to use this new mechanism. Finally this patch implements one concrete subclass for vector index. Before this patch, the custom index type "vector_index" was allowed, but after this patch it gains more validation of its optional parameters (we support 4 specific parameters, with some rules on their values). Of course, the vector index isn't actually implemented in this patch, we are just improving the validation of the index creation statement.	2025-05-27 21:04:50 +02:00
Marcin Maliszkiewicz	7f057af1f2	replica: make non-preemptive keyspace create/update/delete functions public As those operations will be managed by schema_applier class. This will be implemented in following commit.	2025-05-27 20:01:35 +02:00
Marcin Maliszkiewicz	2daa630938	replica: split update keyspace into two phases - first phase is preemptive (prepare_update_keyspace) - second phase is non-preemptive (update_keyspace) This is done so that schema change can be applied atomically. Aditionally create keyspace code was changed to share common part with update keyspace flow. This commit doesn't yet change the behaviour of the code, as it doesn't guarantee atomicity, it will be done in following commits.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	fe0f4033ca	replica: split creating keyspace into two functions This is done so that in following commits insert_keyspace can be used to atomically change schema (as it doesn't yield).	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	aceb1f9659	db: rename create_keyspace_from_schema_partition It only creates keyspace metadata.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	f8fe51640a	db: decouple functions and aggregates schema change notification from merging code	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	52069d954f	db: store functions and aggregates change batch in schema_applier To be used in following commit.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	5fff3097a5	db: decouple tables and views schema change notifications from merging code As post_commit() can't be fully implemented at this stage, it was moved to interim place to keep things working. It will be moved back later.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	6f8579e242	db: store tables and views schema diff in schema_applier It will be used in subsequent commit for moving notifications code.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	b74c1e9ae4	db: decouple user type schema change notifications from types merging code Merging types code now returns generic affected_types structure which is used both for notifications and dropping types. New static function drop_types() replaces dropping lambda used before. While I think it's not necessary for dropping nor notifications to use per shard copies (like it's using before and after this patch) it could just use string parameters or something similar but this requires too many changes in other classes so it's out of scope here.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	3a95edd0d7	service: unify keyspace notification functions arguments Keyspace metadata is not used, only name is needed so we can remove those extra find_keyspace() calls. Moreover there is no need to copy the name.	2025-05-27 20:00:58 +02:00
Marcin Maliszkiewicz	d7202586ca	db: replica: decouple keyspace schema change notifications to a separate function In following commits we want to separate updating code from committing shema change (making it visible). Since notifications should be issued after change is visible we need to separate them and call after committing. In subsequent commits other notification types will be moved too. We change here order of notification calls with regards to rest of schema updating code. I.e. before keyspace notifications triggered before tables were updated, after the change they will trigger once everything is updated. There is no indication that notification listeners depend on this behaviour.	2025-05-27 19:59:47 +02:00
Marcin Maliszkiewicz	ddf9f7ae05	db: add class encapsulating schema merging This commit doesn't yet change how schema merging works but it prepares the ground for it. We split merging code into several functions. Main reasons for it are that: - We want to generalize and create some interface which each subsystem would use. - We need to pull mutation's apply() out of the code because raft will call it directly, and it will contain a mix of mutations from more than one subsystem. This is needed because we have the need to update multiple subsystems atomically (e.g. auth and schema during auto-grant when creating a table). In this commit do_merge_schema() code is split between prepare(), update(), commit(), post_commit(). The idea behind each of these phases is described in the comments. The last 2 phases are not yet implemented as it requires more code changes but adding schema_applier enclosing class will help to create some copied state in the future and implement commit() and post_commit() phases.	2025-05-27 19:33:02 +02:00
Marcin Maliszkiewicz	1eb580973c	generic_server: make shutdown() return void It's always immediately ready so no need to return future<>.	2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz	d76d1766ad	generic_server: skip connection processing logic after shedding the connection Since input and output descriptors are already closed at this point there is no need to call connection::process. This should make shedding use slightly less resources.	2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz	f7e5adaca3	transport: generic_server: remove no longer used connection advertising code	2025-05-27 19:31:09 +02:00
Marcin Maliszkiewicz	81f0e79dc0	transport: move new connection trace logs into connection class ctor/dtor This is a step towards replacing advertise_new_connection/unadvertise_connection by RAII which is less error prone. Advertising will be removed in subsequent commit.	2025-05-27 19:30:56 +02:00
Marcin Maliszkiewicz	371b959539	transport: move cql connections counting into connection class ctor/dtor This is a step towards replacing advertise_new_connection/unadvertise_connection by RAII which is less error prone. Advertising will be removed in subsequent commit.	2025-05-27 19:30:39 +02:00
Dawid Mędrek	c60035cbf6	test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default We've adjusted all of the Boost tests so they respect the invariant enforced by the `rf_rack_valid_keyspaces` configuration option, or explicitly disabled the option in those that turned out to be more problematic and will require more attention. Thanks to that, we can now enable it by default in the test suite.	2025-05-27 18:53:39 +02:00
Dawid Mędrek	237638f4d3	test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the file verify more subtle parts of the behavior of tablets and rely on topology layouts or using keyspaces that violate the invariant the `rf_rack_valid_keyspaces` configuration option is trying to enforce. Because of that, we explicitly disable the option to be able to enable it by default in the rest of the test suite in the following commit.	2025-05-27 18:53:36 +02:00
Anna Stuchlik	efce03ef43	doc: clarify RF increase issues for tablets vs. vnodes This commit updates the guidelines for increasing the Replication Factor depending on whether tablets are enabled or disabled. To present it in a clear way, I've reorganized the page. Fixes https://github.com/scylladb/scylladb/issues/23667 Closes scylladb/scylladb#24221	2025-05-27 17:47:50 +02:00
Dawid Mędrek	22d6c7e702	test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load	2025-05-27 16:01:14 +02:00
Dawid Mędrek	fa62f68a57	test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity We make sure that the keyspaces created in the test are always RF-rack-valid. To achieve that, we change how the test is performed. Before this commit, we first created a cluster and then ran the actual test logic multiple times. Each of those test cases created a keyspace with a random replication factor. That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify the property file of a node (see commit: `eb5b52f598`), so once we set up the cluster, we cannot adjust its layout to work with another replication factor. To solve that issue, we also recreate the cluster in each test case. Now we choose the replication factor at random, create a cluster distributing nodes across as many racks as RF, and perform the rest of the logic. We perform it multiple times in a loop so that the test behaves as before these changes.	2025-05-27 15:52:38 +02:00
Dawid Mędrek	cd615c3ef7	test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity We distribute the nodes used in the test across two racks so we can run the test with `rf_rack_valid_keyspaces` set to true. We want to avoid cross-rack migrations and keep the test as realistic as possible. Since host3 is supposed to function as a new node in the cluster, we change the layout of it: now, host1 has 2 shards and resides in a separate rack. Most of the remaining test logic is preserved and behaves as before this commit. There is a slight difference in the tablet migrations. Before the commit, we were migrating a tablet between nodes of different shard counts. Now it's impossible because it would force us to migrate tablets between racks. However, since the test wants to simply verify that an ongoing migration doesn't interfere with load balancing and still leads to a perfect balance, that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3, so to achieve the goal, one more tablet needs to be migrated, and we test that.	2025-05-27 15:41:27 +02:00
Ferenc Szili	1f9f724441	test: add reproducer and test for mutation source refresh after merge This change adds a reproducer and test for the fix where the local mutation source is not always refreshed after a tablet merge.	2025-05-27 15:18:36 +02:00
Ferenc Szili	d0329ca370	tablets: trigger mutation source refresh on tablet count change Consider the following scenario: - let's assume tablet 0 has range [1, 5] (pre merge) - tablet merge happens, tablet 0 has now range [1, 10] - tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5] - during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time - replica service is asked to consume range [1, 10] of tablet 0 (post merge) We have two possible outcomes: With cache bypass: 1) cache reader is bypassed 2) sstable reader is created on range [1, 10] 3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10] With cache: 1) cache reader is created 2) finds partition with token 5 is cached 3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0) 4) incremental selector consumes the pre-merge sstable spanning range [1, 5] 4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached 4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed. So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read. This change fixes the bug and triggeres a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets. Fixes: #23313	2025-05-27 15:15:43 +02:00
Wojciech Mitros	5074daf1b7	test: actually wait for tablets to distribute across nodes In test_tablet_mv_replica_pairing_during_replace, after we create the tables, we want to wait for their tablets to distribute evenly across nodes and we have a wait_for for that. But we don't await this wait_for, so it's a no-op. This patch fixes it by adding the missing await. Refs scylladb/scylladb#23982 Refs scylladb/scylladb#23997 Closes scylladb/scylladb#24250	2025-05-27 15:12:25 +02:00
Avi Kivity	844a49ed6e	dht: fragment token_range_vector token_range_vector is a linear vector containing intervals of tokens. It can grow quite large in certain places and so cause stalls. Convert it to utils::chunked_vector, which prevents allocation stalls. It is not used in any hot path, as it usually describes vnodes or similar things. Fixes #3335.	2025-05-27 14:47:24 +03:00
Avi Kivity	83c2a2e169	partition_range_compat: generalize wrap/unwrap helpers These helpers convert vectors of wrapped intervals to vectors of unwrapped intervals and vice versa. Generalize them to work on any sequence type. This is in preparation of moving from vectors to chunked_vectors.	2025-05-27 14:47:21 +03:00
Botond Dénes	542b2ed0de	Merge 'Remove req_params facility from API' from Pavel Emelyanov The class was introduced to facilitate path and query parameters parsing from requests, but in fact it's mostly dead code. First, the class introduces the concept of "mandatory" parameters which are seastar path params. If missing, the parameter validation throws, but in all cases where this option is used in scylla it's impossible to get empty path param -- if the parameter is missing seastar returns 404 (not found) before calling handler. Second, the req_params::get<T>() doesn't work for anything but string argument (or types such that optional<T> can be implicitly casted to optional<sstring>). And it's in fact only used to get sstrings, so it compiles and works so far. The remaining ability to parse bool from string is partially duplicated by the validate_bool() method. Using plain method to parse string to bool is less code than req_params introduce. One (arguably) useful thing req_params do it validate the incoming request _not_ to contain unknown query parameters. However, quite a few endpoints use this, most of them just cherry-pick parameters they want and ignore the others. There's already a comprehensive description of accepted parameters for each endpoint in api-doc/ and req_params duplicate it. Good validation code should rely on api-doc/, not on its partial copy. Having said that, this PR introduces validate_bool_x() helper to do req_params-like parsing of strings to bools, patches existing handlers to use existing parameters parsing facilities (such as validate_keyspace() and parse_table_infos()) and drops the req_params. Closes scylladb/scylladb#24159 * github.com:scylladb/scylladb: api: Drop class req_params api: Stop using req_params in parse_scrub_options api: Stop using req_params in tasks::force_keyspace_compaction_async api: Stop using req_params in ss::force_keyspace_compaction api: Stop using req_params in ss::force_compaction api: Stop using req_params in cf::force_major_compaction api: Add validate_bool_x() helper	2025-05-27 14:29:05 +03:00
Ernest Zaslavsky	7d0d3ec1c8	load_and_stream: Add abortion flow to mutation streaming * The new abort command explicitly represents the abortion flow in mutation streaming, clearly identifying operations that are intentionally aborted. This reduces ambiguity around failures in streaming operations. * In the error-handling section, aborted operations are now explicitly marked as the cause of the streaming failure. This allows us to differentiate them from genuine errors and appropriately adjust log severity to reduce unnecessary alarm caused by aborted streaming failures. * To avoid alarming users with excessive error logs, log severity for streaming failures caused by aborted operations has been downgraded. This helps keep logs cleaner and prevents unnecessary concerns. * A new feature has been added to ensure mixed clusters during updates do not receive unsupported RPC messages, improving compatibility and stability. fixes: https://github.com/scylladb/scylladb/issues/23076 Closes scylladb/scylladb#23214	2025-05-27 14:21:58 +03:00
Dawid Mędrek	1199c68bac	test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity We assign the nodes created by the test to separate racks. It has no impact on the test since the keyspace used in the test uses RF=2, so the tablet replicas will still be the same.	2025-05-27 13:18:11 +02:00
Dawid Mędrek	e4e3b9c3a1	test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity We distribute the nodes used in the test between two racks. Although that may affect how tablets behave in general, this change will not have any real impact on the test. The test verifies that load balancing eventually balances tablets in the cluster, which will still happen. Because of that, the changes in this commit are safe to apply.	2025-05-27 13:18:09 +02:00
Dawid Mędrek	6e2fb79152	test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity We distribute the nodes used in the test between two racks. Although that may have an impact on how tablets behave, it's orthogonal to what the test verifies -- whether the topology coordinator is continuously in the tablet migration track. Because of that, it's safe to make this change without influencing the test.	2025-05-27 13:18:07 +02:00
Botond Dénes	485df63fd5	Merge 'Extend compaction_history table with additional compaction statistics' from Łukasz Paszkowski Currently, the `system.compaction_history` table miss information like the type of compaction (cleanup, major, resharding, etc), the sstable generations involved (in and out), shard's id the compaction was triggered on and statistics on purged tombstones to be collected during compaction. The series extends the table with the following columns: - "compaction_type" (text) - "shard_id" (int) - "sstables_in" (list<sstableinfo_type>) - "sstables_out" (list<sstableinfo_type>) - "total_tombstone_purge_attempt" (long) - "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long) - "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long) with a user defined type `sstableinfo_type` that holds the information about sstable file - generation (uuid) - origin (text) - size (long) Additional statistics stored in the compaction_history have been incorporated in the API `/compaction_manager/compaction_history` and the `nodetool compactionhistory` command. No backport is required. It extends the existing compaction history output. Fixes https://github.com/scylladb/scylladb/issues/3791 Closes scylladb/scylladb#21288 * github.com:scylladb/scylladb: nodetool: Refactor of compactionhistory_operation nodetool: Add more stats into compactionhistory output api/compaction_manager: Extend compaction_history api compaction: Collect tombstone purge stats during compaction compacting_reader: Extend to accept tombstone purge statistics mutation_compactor: Collect tombstone purge attempts compaction_garbage_collector: Extend return type of max_purgeable_fn compaction: Extend compaction_result to collect more information system_keyspace: Upgrade compaction_history table system_keyspace: Create UDT: sstableinfo_type system_keyspace: Extract compaction_history struct system_keyspace: Squeeze update_compaction_history parameters compaction/compaction_manager: update_history accepts compaction_result as rvalue	2025-05-27 14:12:13 +03:00
Anna Stuchlik	b197d1a617	doc: update migration tools overview This commit updates the migration overview page: - It removes the info about migration from SSTable to CQL. - It updates the link to the migrator docs. Fixes https://github.com/scylladb/scylladb/issues/24247 Refs https://github.com/scylladb/scylladb/pull/21775 Closes scylladb/scylladb#24258	2025-05-27 14:07:35 +03:00
Michał Chojnowski	185a032044	utils/stream_compressor: allocate memory for zstd compressors externally The default and recommended way to use zstd compressors is to let zstd allocate and free memory for compressors on its own. That's what we did for zstd compressors used in RPC compression. But it turns out that it generates allocation patterns we dislike. We expected zstd not to generate allocations after the context object is initialized, but it turns out that it tries to downsize the context sometimes (by reallocation). We don't want that because the allocations generated by zstd are large (1 MiB with the parameters we use), so repeating them periodically stresses the reclaimer. We can avoid this by using the "static context" API of zstd, in which the memory for context is allocated manually by the user of the library. In this mode, zstd doesn't allocate anything on its own. The implementation details of this patch adds a consideration for forward compatibility: later versions of Scylla can't use a window size greater than the one we hardcoded in this patch when talking to the old version of the decompressor. (This is not a problem, since those compressors are only used for RPC compression at the moment, where cross-version communication can be prevented by bumping COMPRESSOR_NAME. But it's something that the developer who changes the window size must _remember_ to do). Fixes #24160 Fixes #24183 Closes scylladb/scylladb#24161	2025-05-27 12:43:11 +03:00
Jenkins Promoter	76dddb758e	Update pgo profiles - x86_64	2025-05-27 12:02:49 +03:00
Pavel Emelyanov	bd3bd089e1	sstables_loader: Fix load-and-stream vs skip-cleanup check The intention was to fail the REST API call in case --skip-cleanup is requested for --load-and-stream loading. The corresponding if expression is checking something else :( despite log message is correct. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24208	2025-05-27 12:01:01 +03:00
Jenkins Promoter	de9d9c9ece	Update pgo profiles - aarch64	2025-05-27 11:59:56 +03:00
Andrzej Jackowski	555d897a15	test: wait for normal state propagation in test_auth_v2_migration By default, cluster tests have skip_wait_for_gossip_to_settle=0 and ring_delay_ms=0. In tests with gossip topology, it may lead to a race, where nodes see different state of each other. In case of test_auth_v2_migration, there are three nodes. If the first node already knows that the third node is NORMAL, and the second node does not, the system_auth tables can return incomplete results. To avoid such a race, this commit adds a check that all nodes see other nodes as NORMAL before any writes are done. Refs: #24163 Closes scylladb/scylladb#24185	2025-05-27 11:41:09 +03:00
Nikos Dragazis	eaa2ce1bb5	sstables: Fix race when loading checksum component `read_checksum()` loads the checksum component from disk and stores a non-owning reference in the shareable components. To avoid loading the same component twice, the function has an early return statement. However, this does not guarantee atomicity - two fibers or threads may load the component and update the shareable components concurrently. This can lead to use-after-free situations when accessing the component through the shareable components, since the reference stored there is non-owning. This can happen when multiple compaction tasks run on the same SSTable (e.g., regular compaction and scrub-validate). Fix this by not updating the reference in shareable components, if a reference is already in place. Instead, create an owning reference to the existing component for the current fiber. This is less efficient than using a mutex, since the component may be loaded multiple times from disk before noticing the race, but no locks are used for any other SSTable component either. Also, this affects uncompressed SSTables, which are not that common. Fixes #23728. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#23872	2025-05-27 11:26:35 +03:00
Botond Dénes	2739eb49fd	Merge 'docs: remove API reference redirect' from David Garcia Fix for https://github.com/scylladb/scylladb/pull/24097 The stable branch does not contain the split API reference yet. This change fixes the 404 error raised when accessing the API reference on the stable branch due to the redirect. Closes scylladb/scylladb#24259 * github.com:scylladb/scylladb: docs: fix typo docs: remove API reference redirect	2025-05-27 11:24:27 +03:00
Nadav Har'El	8487d81c6e	Merge 'test: mark difference in handling IFs in LWT as scylla_only' from Andrzej Jackowski There is a difference how ScyllaDB and Cassandra handle conditional batches with different IF statements (such as "IF EXISTS" and "IF NOT EXISTS"). Cassandra tries to detect condition conflicts, and prints an error instead of silently failing the batch, but in ScyllaDB we considered this check to be inconsistent and unhelpful, and decided not to implement it. In this series, we extend the documentation of the ScyllaDB behaviour by extending the documents and improving relevant LWT tests. Fixes: https://github.com/scylladb/scylladb/issues/13011 Backport not needed, only docs and minor tests changes. Closes scylladb/scylladb#24086 * github.com:scylladb/scylladb: test: mark difference in handling IFs in LWT as scylla_only docs: cql: add explicit explanation how mixing IFs works in LWT docs: lwt: add two missing spaces	2025-05-27 09:35:41 +03:00
Evgeniy Naydanov	efdb2abdc6	test.py: dtest: make bypass_cache_test.py run using test.py As a part of the porting process, copy missed utility functions from scylla-dtest, remove unused imports and markers, and add single_node marker description to pytest.ini Enable the test in suite.yaml (run in dev mode only)	2025-05-27 05:48:26 +00:00
Evgeniy Naydanov	3a2410324c	test.py: dtest: add missed ScyllaCluster.nodetool() The method executes nodetool command on each running node in a cluster.	2025-05-27 05:48:26 +00:00
Evgeniy Naydanov	6105bb9530	test.py: dtest: copy unmodified bypass_cache_test.py Test is disabled in suite.yaml	2025-05-27 05:48:26 +00:00
Andrzej Jackowski	7dc0c4cf4f	test: close logfile/socket_dir for stopped servers in recycle_cluster PythonTestSuite::recycle_cluster is a function that releases resources of an old, dirty cluster to make it reusable. It closes log_file and maintenance_socket_dir for running nodes in a dirty cluster, however it doesn't do the same for stopped nodes. It leads to leakage of file descriptors of stopped nodes, which in turn can lead to hitting ulimit of open files (that is often 1024) if the leaking test is repeated with `./test.py --repeat ...`. The problem was detected when tests from `test/cluster/dtest/` directory were executed with high `repeat` value. This commit extends `recycle_cluster` to close and cleanup logfile and `socket_dir` for nodes that are stopped (because self.servers in ScyllaCluster is ChainMap of self.running and self.stopped). Closes scylladb/scylladb#24243	2025-05-27 08:37:43 +03:00
David Garcia	d99d1c315c	docs: remove [erno X] prefix from metrics logger Closes scylladb/scylladb#24246	2025-05-27 08:37:11 +03:00
David Garcia	3e331cfbbe	docs: fix typo	2025-05-26 21:34:23 +02:00
David Garcia	eefc9c33e8	docs: remove API reference redirect The stable branch does not contain the split API reference yet. This change fixes the 404 error raised when accessing the API reference on the stable branch.	2025-05-26 21:32:07 +02:00
Andrzej Jackowski	ea6ef5d0aa	test: mark difference in handling IFs in LWT as scylla_only There is a difference how ScyllaDB and Cassandra handle conditional batches with different IF statements (such as "IF EXISTS" and "IF NOT EXISTS"). Cassandra tries to detect condition conflicts, and prints an error instead of silently failing the batch, but in ScyllaDB we considered this check to be inconsistent and unhelpful, and decided not to implement it. This commit: - Make test_lwt_with_batch_conflict_1 scylla_only instead of xfail, change the scenario to pass with the current implementation. - Add test_lwt_with_batch_conflict_3 that shows how Cassandra fails batch statement with different conditions, even when the conditions are not contradictory. - Add test_lwt_with_batch_conflict_4/5 that shows how static rows are handled in conditional batches. Fixes: #13011	2025-05-26 15:47:11 +02:00
Andrzej Jackowski	2d4acb623e	docs: cql: add explicit explanation how mixing IFs works in LWT There is a difference how ScyllaDB and Cassandra handle conditional batches with different IF statements (such as "IF EXISTS" and "IF NOT EXISTS"). This commit explicitly documents the differences in the behavior. Refs: #13011	2025-05-26 15:13:01 +02:00
Piotr Dulikowski	4508823294	Merge 'test.py: dtest: few fixes missed in the initial implementation' from Evgeniy Naydanov There are few problems found in the dtest shim code after scylladb/scylladb#21580 was merged: - The call of `init_default_config()` method was missed in scylladb/scylladb#21580. It is required to handle dtest options and markers. - The implementation of dtest shim uses `server_id` to format a name of a node in a cluster. This is a difference in behavior with dtest. Some of dtests use code like `cluster.nodes()["node1"]` to get access to a node object. - Default timeout was missed in `ScyllaNode.wait_until_stopped()` method. Set it to 600 for debug mode or to 127 otherwise. Closes scylladb/scylladb#24225 * github.com:scylladb/scylladb: test.py: dtest: set default wait_seconds based on build mode test.py: dtest: name nodes in cluster using index starting from 1 test.py: dtest: initialize default config in dtest setup fixture	2025-05-26 13:37:12 +02:00
Yaron Kaikov	89ace09c18	[workflow]: add conflict_reminder to PRs based against `master` Today we send a reminder to PR's author when backport PRs has conflicts. Often, PR authors wait for their PR to be reviewed/merged, but the merge is not happening because the PR now conflicts with master and so maintainers won't merge it. This can lead to a stall, where maintainers wait for the author to rebase and authors are waiting for merge. In this PR we added the ability to notify the PR author as soon as base branch moved forward and rebase is requried Fixes: https://github.com/scylladb/scylla-pkg/issues/4955 Closes scylladb/scylladb#24209	2025-05-26 14:30:06 +03:00
David Garcia	6f722e8bc0	docs: split api reference in smaller files Closes scylladb/scylladb#24097	2025-05-26 12:06:59 +03:00
Radosław Cybulski	90ebea5ebb	Move mutation_fragment::kind into data object Move `mutation_fragment::kind` enum into data object, reducing size of the object from 16 to 8 bytes on current machines.	2025-05-26 11:06:54 +02:00
Radosław Cybulski	ef51bb9bd3	Make mutation_fragment::kind enum 1 byte size Adds std::uint8_t base to `Make mutation_fragment_v2::kind` making it one byte size.	2025-05-26 11:06:54 +02:00
Radosław Cybulski	003e79ac9e	Move mutation_fragment_v2::kind into data object Move `mutation_fragment_v2::kind` enum into data object, reducing size of the object from 16 to 8 bytes on current machines.	2025-05-26 11:06:53 +02:00
Radosław Cybulski	d211119e49	Make mutation_fragment_v2::kind enum 1 byte size Add std::uint8_t as base to `mutation_fragment_v2::kind` enum, which will resize it to 1 byte.	2025-05-26 11:06:53 +02:00
David Garcia	bf9534e2b5	docs: fix \t (tab) is not rendered correctly Closes scylladb/scylladb#24096	2025-05-26 12:06:03 +03:00
Avi Kivity	29932a5af1	pgo: drop Java configuration Since `5e1cf90a51` ("build: replace tools/java submodule with packaged cassandra-stress") we run pre-packaged cassandra-stress. As such, we don't need to look for a Java runtime (which is missing on the frozen toolchain) and can rely on the cassandra-stress package finding its own Java runtime. Fix by just dropping all the Java-finding stuff. Note: Java 11 is in fact present on the frozen toolchain, just not in a way that pgo.py can find it. Fixes #24176. Closes scylladb/scylladb#24178	2025-05-26 10:16:03 +02:00
Avi Kivity	f195c05b0d	untyped_result_set: mark get_blob() as returning unfragmented data Blobs can be large, and unfragmented blobs can easily exceed 128k (as seen in #23903). Rename get_blob() to get_blob_unfragmented() to warn users. Note that most uses are fine as the blobs are really short strings. Closes scylladb/scylladb#24102	2025-05-26 09:40:34 +02:00
Michał Chojnowski	ff8a119f26	test/boost/sstable_compressor_factory_test: define a test suite name It seems that tests in test/boost/combined_tests have to define a test suite name, otherwise they aren't picked up by test.py. Fixes #24199 Closes scylladb/scylladb#24200	2025-05-26 09:35:30 +02:00
Anna Stuchlik	d303edbc39	doc: remove copyright from Cassandra Stress This commit removes the Apache copyright note from the Cassandra Stress page. It's a follow up to https://github.com/scylladb/scylladb/pull/21723, which missed that update (see https://github.com/scylladb/scylladb/pull/21723#discussion_r1944357143). Cassandra Stress is a separate tool with separate repo with the docs, so the copyright information on the page is incorrect. Fixes https://github.com/scylladb/scylladb/issues/23240 Closes scylladb/scylladb#24219	2025-05-26 09:35:30 +02:00
Pavel Emelyanov	2a253ace5e	Merge 'test.py: add coverage for boost with pytest execution' from Andrei Chekun This PR adds the possibility to gather coverage for the boost tests when they're executed with pytest. Since the pytest will be used as the main runner for boost tests as well, we need this before switching the runners. Closes scylladb/scylladb#24236 * github.com:scylladb/scylladb: test.py: add support for coverage for boost test test.py: get the temp dir from facade	2025-05-26 10:18:53 +03:00
Andrei Chekun	537054bfad	test.py: add support for coverage for boost test This PR adds the possibility to gather coverage for the boost tests when they're executed with pytest. Since the pytest will be used as the main runner for boost tests as well, we need this before switching the runners.	2025-05-23 12:54:54 +02:00
Andrei Chekun	c5a7f3415c	test.py: get the temp dir from facade No need to get the temp dir from the options when facade has this information already.	2025-05-23 12:54:48 +02:00
Nadav Har'El	d2844055ad	Merge 'index: implement schema management layer for vector search indexes' from null This pull request adds support for creating custom indexes (at a metadata level) as long as a supported custom class is provided (currently only vector search). The patch contains: - a change in CREATE INDEX statement that allows for the USING keyword to be present as long as one of the supported classes is used - support for describing custom indexes in the DESCRIBE statement - unit tests Co-authored by: @Balwancia Closes scylladb/scylladb#23720 * github.com:scylladb/scylladb: test/cqlpy: add custom index tests index: support storing metadata for custom indices	2025-05-22 12:19:36 +03:00
Pavel Emelyanov	a0d2e63303	Merge 'test.py: add the possibility to gather resource metrics for C++ tests' from Andrei Chekun Move the run_process method to resource gather instance, since we need to start a monitor to check memory consumption in the cgroup. Pytest has concept of the test, but it is completely different from test.py. Resource gather instance take test instance to save and extract information about the test. Additional method emulating test.py test instance added not to rewrite the resource gather instance. Finally, combining all these changes to have ability to get metrics for test in both runners: test.py and pytest. Closes scylladb/scylladb#24091 * github.com:scylladb/scylladb: test.py: add missing parameter for boost tests for pytest runner test.py: add support for boost_data_test_case in combined tests test.py: clean log files after a successful run test.py: attach output of the boost test to the report test.py: fix metrics DB location test.py: move run_process to resource_gather.py test.py: unify using constant for finding repo root directory test.py: refactor run_process in facade.py test.py: add the possibility to create a test alike object	2025-05-22 10:34:34 +03:00
Evgeniy Naydanov	8dc5413f54	test.py: dtest: set default wait_seconds based on build mode Default timeout was missed in `ScyllaNode.wait_until_stopped()` method. Set it to 600 for debug mode or to 127 otherwise.	2025-05-22 06:39:03 +00:00
Evgeniy Naydanov	eca5d52f1d	test.py: dtest: name nodes in cluster using index starting from 1 The current implementation of dtest shim use `server_id` to format a name of a node in a cluster. This is a difference in behavior with dtest. Some of dtests use code like `cluster.nodes()["node1"]` to get access to a node object. This commit changes it to be more consistent with dtest.	2025-05-22 06:34:03 +00:00
Evgeniy Naydanov	91e29a302a	test.py: dtest: initialize default config in dtest setup fixture The call of `init_default_config()` method was missed in #21580. It is required to handle dtest options and markers.	2025-05-22 06:22:04 +00:00
Andrei Chekun	8812b14078	test.py: add missing parameter for boost tests for pytest runner Since we are running tests with a pytest, we don't need a report at the end of the run.	2025-05-21 19:41:41 +02:00
Andrei Chekun	66b014621e	test.py: add support for boost_data_test_case in combined tests Change the parsing logic of combined tests to support a case when boost_data_test_case used that produced additional lines in the output.	2025-05-21 19:41:41 +02:00
Andrei Chekun	88d24d8ad5	test.py: clean log files after a successful run Clean different output files from the boost and unit tests. Move logs for boost test to the testlog directory instead of having additional directory pytest	2025-05-21 19:41:41 +02:00
Andrei Chekun	a956dd8770	test.py: attach output of the boost test to the report Added attaching the output of the test in case of fail to the Allure report	2025-05-21 19:41:39 +02:00
Andrei Chekun	ac86cc9f6d	test.py: fix metrics DB location Fix the issue introduced with scylladb/scylladb#22960. Suite log dir was changed, and the path for metrics DB was relying on it. As a result, DB is now located in the mode directory instead of the root of the testlog.	2025-05-21 15:37:15 +02:00
Andrei Chekun	b5b69710bd	test.py: move run_process to resource_gather.py Move the run_process method to the resource gather instance, since we need to start monitor to check memory consumption in the cgroup. Since resource_gather needs test.py test object, and pytest has no clue about it, adding a simple namespace object to emulate such a test object. It needed only to gather some information regarding the test to be able to add records to the DB. Since we have two facades that can share the same run process procedure, adding a common method to handle this to avoid code duplication.	2025-05-21 15:34:34 +02:00
Andrei Chekun	3bcd6db718	test.py: unify using constant for finding repo root directory Instead of finding dynamically the repo root directory relatively to the temp dir, that's in most cases in the repo, will fail if a non-default temp dir parameter is used. Additionally, to have the single source of truth of finding the repo root directory switching to the constants.	2025-05-21 15:34:34 +02:00
Andrei Chekun	4e18444831	test.py: refactor run_process in facade.py Add injecting environment variables to the process Switch from print to propper logger Set buffer size to 1 to avoid losing any data from the boost test if the test collapsed. Currently, run process logs and return stdout and stderr, but boost tests are using stderr only. So stderr redirected to stdout. This helps with Jenkins as well, since we are reducing the number of files to store.	2025-05-21 15:34:34 +02:00
Andrei Chekun	38310975c5	test.py: add the possibility to create a test alike object resource_gather.py needs test.py test object to work. It needs some information about the test to be able to write down this information to the DB with metrics. When running with pytest, there's no such test object, that's why adding make_test_object to mimic the test.py's test object. Switching the getting the mode for constructing path to chgroup to test instead of suite. They are the same, but this helps to have emulate less in make_test_object method.	2025-05-21 15:34:34 +02:00
Pavel Emelyanov	dac7589cef	Revert "encryption_test: Catch exact exception" This reverts commit `2d5c0f0cfd`. KMS tests became flaky after it: #24218 Need to revisit.	2025-05-20 13:52:14 +03:00
Petr Gusev	0443081b0d	build: fix merge-compdb.py for CMake 'output' attributes compile_commands.json is used by LSPs (e.g. `clangd` in VS Code) for code navigation. `merge-compdb.py`, called by `configure.py`, merges these files from Scylla, Seastar, and Abseil. The script filters entries by checking the output attribute against a given prefix. This is needed because Scylla’s compile_commands.json is generated by Ninja and includes all build modes, in case the user specified multiple ones in the call to configure.py. Seastar and Abseil databases, generated by CMake, used to omit the output attribute, so filtering did not apply. Starting with `CMake 3.20+`, output attributes are now included and do not match the expected prefix. For example, they could be of the form `absl/synchronization/CMakeFiles/synchronization.dir/internal/futex_waiter.cc.o`. This causes relevant entries from Seastar and Abseil to be filtered out. This patch refactors `merge-compdb.py` to allow specifying an optional prefix per input file, preserving the intent of applying the output filtering logic only for ninja-generated Scylla compdb file. Closes scylladb/scylladb#24211	2025-05-20 08:43:09 +03:00
Piotr Dulikowski	c15cf54e3d	Merge 'test.py: migrate alternator_tests.py from dtest suite' from Evgeniy Naydanov We have a significant amount of tests in scylla-dtest repository and I believe most of them can be just copied to test.py framework with adding a relatively small shim code. In this PR I done that for 2 tests: [alternator_tests.py](https://github.com/scylladb/scylla-dtest/blob/next/alternator_tests.py) and [error_example_test.py](https://github.com/scylladb/scylla-dtest/blob/next/error_example_test.py) One of the problems is async nature of test.py framework and synchronous of scylla-dtest. It was resolved by using universalasync third-party library. Other problem is ccmlib and it's resolved by adding a shim code (`test/dtest/ccmlib`) ccmlib has a lot of dead code and not all it's features used by scylla-dtest, in this PR I added checks that we will not accidentally use some of them or miss something. And when we'll done the migration we can easily remove all unused parameters and these checks. `error_example_test.py` copied as is (just license preamble added), `alternator_tests.py` has small changes: 1. License preamble 2. Remove unused imports 3. Remove unneeded `skip_if` marker (I think it can be backported to dtest, or we can remove the test from dtest after merging this PR) ```diff --- ../../../scylla-dtest/alternator_tests.py +++ alternator_tests.py @@ -1,17 +1,20 @@ +# +# Copyright (C) 2025-present ScyllaDB +# +# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0 +# + import logging import operator import os import random -import shutil import string -import subprocess import tempfile import time from ast import literal_eval from concurrent.futures.thread import ThreadPoolExecutor from copy import deepcopy from decimal import Decimal -from pathlib import Path from pprint import pformat import boto3.dynamodb.types @@ -46,7 +49,6 @@ ) from dtest_class import get_ip_from_node, wait_for from tools.cluster import new_node -from tools.marks import issue_open, with_feature from tools.misc import set_trace_probability from tools.retrying import retrying @@ -168,7 +170,6 @@ read_and_delete_set_elements_thread.join() @pytest.mark.next_gating - @pytest.mark.skip_if(with_feature("tablets") & issue_open("#18002")) def test_decommission_during_dynamo_load(self): self.prepare_dynamodb_cluster(num_of_nodes=3) node1, node2, node3 = self.cluster.nodelist() ``` Because all tests in this repo are considered to be "gating", I removed all not next_gating tests and all dtest's suites markers as a separate commit. To reduce tests execution time run the tests in dev mode only and made some sleeps smaller. In result, 23 tests added in total (22 in `test_alternator.py` and 1 in `test_error_example`.) The added tests will increase CI time by ~2х4 =8 minutes. Closes scylladb/scylladb#21580 * github.com:scylladb/scylladb: test.py: dtest/alternator_tests.py: make sleep intervals smaller test.py: dtest/alternator_tests.py: remove not next_gating tests test.py: migrate alternator_tests.py from dtest test.py: initial implementation of dtest/ccm shim test.py: manager: add server_get_returncode() method test.py: manager: change CLI and env options on a node start test.py: REST API: add set_trace_probability() method test.py: REST API: add get_tokens() method test.py: rework log_browsing for dtest migration	2025-05-20 00:13:16 +02:00
Evgeniy Naydanov	e456f0ed7b	test.py: dtest/alternator_tests.py: make sleep intervals smaller	2025-05-19 12:27:32 +00:00
Evgeniy Naydanov	8dd86818a0	test.py: dtest/alternator_tests.py: remove not next_gating tests Remove all not next_gating tests and remove any dtest suites markers because all tests in this repo are considered to be "gating".	2025-05-19 12:27:32 +00:00
Evgeniy Naydanov	57c1035146	test.py: migrate alternator_tests.py from dtest The test almost unmodified except remove unneeded skipif mark and unused imports.	2025-05-19 12:27:32 +00:00
Evgeniy Naydanov	ac1551892b	test.py: initial implementation of dtest/ccm shim Use universalasync library to make test.py async code compatible with synchronous code of dtest/ccm Also, copied unmodified error_example_test.py from dtest as an example. Run the test in `dev` mode only.	2025-05-19 12:27:31 +00:00
Evgeniy Naydanov	2cb640f95c	test.py: manager: add server_get_returncode() method The method return None if Scylla process is still running or returncode. If there is no Scylla process launched then raise NoSuchProcess exception.	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	d874beb17f	test.py: manager: change CLI and env options on a node start Add parameters to server_start() method to provide ability to change Scylla' CLI and env options on a node start. Also, add `expected_server_up_state` parameter as we have for server_add() method.	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	5d3b54aa9b	test.py: REST API: add set_trace_probability() method	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	a16a4b6171	test.py: REST API: add get_tokens() method Get a list of the tokens for the specified node. Optional `endpoint` parameter can be provided.	2025-05-19 11:50:55 +00:00
Evgeniy Naydanov	f6e3fdd778	test.py: rework log_browsing for dtest migration Rework `ScyllaLogFile.wait_for()` method to make it easier to add required methods to ScyllaNode class of ccm-like shim. Also, added `ScyllaLogFile.grep_for_errors()` method and reworked `ScyllaLogFile.grep()`	2025-05-19 11:50:55 +00:00
Łukasz Paszkowski	0a2f0c6852	nodetool: Refactor of compactionhistory_operation Simplify code by using std::apply that unpacks std::array into separate items to pass further to a callable. This simplifies the code that looks: fmt::print(std::cout, fmt::runtime(header_row_format.c_str()), header_row[0], header_row[1], header_row[2], header_row[3], header_row[4], header_row[5], header_row[6], header_row[7], header_row[8], header_row[9], header_row[10], header_row[11], header_row[12], header_row[13]); into something like: std::apply(fh, header_row);	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	edb666f461	nodetool: Add more stats into compactionhistory output Incorporate additional statistics stored in the compaction_history system table. Depending on the requested format type, the output has different form. Remove unnecessary duplicated history_entry struct and instead use extracted db::compaction_history_entry structure. Running the cql command: select * from system.compaction_history; prints sstable's generation type as UUID (e.g. 5a5cf800-b617-11ef-a97d-8438c36f0e31), see generation_type::data_value() which is different than its fmt format (e.g. 3glx_0srx_1pasg2ksepk902v8dt). Therefore, to unify the outputs, generation_type is converted to data_value before it is printed.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	583cc675ce	api/compaction_manager: Extend compaction_history api Extend api of /compaction_manager/compaction_history to include newly added columns to the compaction history table from the previous patches.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	2793369288	compaction: Collect tombstone purge stats during compaction Collect tombstone purge statistics like + total number of purge attempts + number of purge failures due to data overlapping with memtables + number of purge failures due to data overlapping with non-compacting sstables and expose them in the compaction_stats structure.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	6b729fabc9	compacting_reader: Extend to accept tombstone purge statistics Extends the make_compacting_reader funtion and the constructor of the compacting_reader, in order to accept an optional pointer to the tombstone purge statistics structure that is later passed further down to compact_mutation_state.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	546b2c191f	mutation_compactor: Collect tombstone purge attempts Let compact_mutation_state collect all tombstone purge attempts and failures. For this purpose a new statistic structure is created (tombstone_purge_stats) and the relative stats are collected in the can_purge_tombstone method. The statistics are collect only for sstables compaction. An optional statistics structure can be passed in via compact_mutation_state constructor.	2025-05-16 20:00:00 +02:00
Łukasz Paszkowski	503d4f014c	compaction_garbage_collector: Extend return type of max_purgeable_fn Currently, when a max purgeable timestamp is computed, there is no information where it comes from and how the value was obtained. Take compaction, if there are memtables or other uncompacting sstables possibly shadowing data, the timestamp is decreased to ensure a tombstone is not purged but the caller does not know what that the timestamp has its value. In this patch, we extend the return type of max_purgeable_fn to contain not only a timestamp but also an information on how it was computed. This information will be required to collect statistics on tombstone purge failures due to overlapping memtables/uncompacting sstables that come later in the series.	2025-05-16 19:59:54 +02:00
Anna Stuchlik	2d7db0867c	doc: fix the product name for version 2025.1 Starting with 2025.1, ScyllaDB versions are no longer called "Enterprise", but the OS support page still uses that label. This commit fixes that by replacing "Enterprise" with "ScyllaDB". This update is required since we've removed "Enterprise" from everywhere else, including the commands, so having it here is confusing. Fixes https://github.com/scylladb/scylladb/issues/24179 Closes scylladb/scylladb#24181	2025-05-16 12:16:00 +02:00
Avi Kivity	37f9cf6de6	dist: rpm: override %_sbindir for Fedora 42 Fedora 42 merged /usr/sbin into /usr/bin [1]. As part of that change the rpm macro %_sbindir was redefined from /usr/sbin to /usr/bin. As a result RPM build on Fedora 42 fails: install.sh places some files into /usr/sbin, while rpmbuild looks for them in /usr/bin. We could resolve this either by following the change and moving the files to /usr/bin as well, or fixing the spec to place the files in /usr/sbin. The former is more difficult: - what about Debian/Ubuntu? - what about older RPM-based distributions (like all RHEL distributions)? - what about scripts that hard-code /usr/sbin/<scylla utility>? So we pick the latter, and redefine %_sbindir to /usr/sbin. Since that directory still exists (as a symlink), installation on systems with merged /usr/bin and /usr/sbin will work. We'll have to address the problem later (likely by installing to either /usr/bin or /usr/sbin depending on context), but for now, this is a simple solution that works everywhere. [1] https://fedoraproject.org/wiki/Changes/Unify_bin_and_sbin Closes scylladb/scylladb#24101	2025-05-16 12:05:29 +02:00
Aleksandra Martyniuk	9c03255fd2	cql_test_env: main: move stream_manager initialization Currently, stream_manager is initialized after storage_service and so it is stopped before the storage_service is. In its stop method storage_service accesses stream_manager which is uninitialized at a time. Move stream_manager initialization over the storage_service initialization. Fixes: #23207. Closes scylladb/scylladb#24008	2025-05-15 17:17:35 +03:00
Avi Kivity	4f87362abb	compaction_manager: drop gratuitous conversion from interval to wrapped_interval The conversion is unnecessary and likely dates back from before the split between interval and wrapped_interval. It gets in the way of making the conversion explicit. Closes scylladb/scylladb#24164	2025-05-15 16:15:55 +03:00
Nadav Har'El	27ad772a66	test/cqlpy: fix "run --release 2025.1" This patch fixes "test/cqlpy/run --release 2025.1" which fails as follows on all tests with indexes or views: Secondary indexes are not supported on base tables with tablets test/cqlpy/run can run cqlpy (and alternator) tests on various official releases of Scylla which it knows how to download. When running old versions of Scylla, we need to change the configuration options to those that were needed on specific versions. On new versions of Scylla we need to pass --experimental-features=views-with-tablets to be able to test materialized views, but in older versions we need to remove that parameter because it didn't exist. We incorrectly removed it for any versions 2025.1 or earlier, but that's incorrect - it just needs to be removed for versions strictly earlier than 2025.1 - it is needed for 2025.1 (I tested it is indeed needed even in the earliers RCs). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#24144	2025-05-15 16:13:01 +03:00
Pavel Emelyanov	2f5b452c7c	api: Drop class req_params It's not unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:08:52 +03:00
Pavel Emelyanov	9628c3a4a5	api: Stop using req_params in parse_scrub_options The "keyspace" and "cf" pair of options are now parsed similarly to how recently changed ss::force_keyspace_compaction handler does. The "scrub_mode" query param is saved directly into sstring variable and its presense is checked by .empty() call. If the parameter is missing, the request::get_query_param() would return empty string, so the change is correct. The "skip_corrupted" is boolean option, other options are already parsed by hand, without the help of req_params facilities. There's a test that validates the work of req_params::process() of scrub endpoint -- it passes "invalid" options. This test is temporarily removed according to the PR description. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:57 +03:00
Pavel Emelyanov	fd0128849e	api: Stop using req_params in tasks::force_keyspace_compaction_async This handler is in fact duplicates the cf::force_major_compaction in how it parses its options, so the change is the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:53 +03:00
Pavel Emelyanov	09c9a5baa7	api: Stop using req_params in ss::force_keyspace_compaction The "keyspace" mandatory param and "cf" query one are used, respectively, to get and validate keyspace and to parse table infos. Both actions can be used with the corresponding parse_table_infos() overload. Other parameters are boolean query ones and can be parsed directly. By and large this change repeats the change in cf::force_major_compaction done previously. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Pavel Emelyanov	f7e8d6ba09	api: Stop using req_params in ss::force_compaction This handler only has two query parameters that can be parsed using the validate_bool_x helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Pavel Emelyanov	a320550bd1	api: Stop using req_params in cf::force_major_compaction The mandatory "name" parameter can be picked directly from request path params, as described in the PR description. The "split_output" is placeholder and is just checked for being there at all, without any parsing. Other parameters are query ones too, and are parsed with the help of recently introduced validate_bool_x helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Pavel Emelyanov	253c82f03a	api: Add validate_bool_x() helper There's validate_bool() one that converts "true" to true and "false" to false. This helper mimics the req_params' parser of bool and renders true from "true", "yes" or "1" and false from "false", "no" or "0" (all case insensitively). Unlike its prototype, which renders disengaged optional bool in case the parameter is empty, this helper returns the passed default value. Will replace the req_params eventually. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-15 11:07:52 +03:00
Botond Dénes	697945820b	Merge 'utils: chunked_vector: add some modifiers' from Avi Kivity chunked_vector is a replacement for std::vector that avoids large contiguous allocations. In this series, we add some missing modifiers and improve quality-of-life for chunked_vector users (the static_assert patch). Those modifiers were generally unused since they have O(n) complexity and therefore not useful for hot paths, but they are used in some control plane code on vectors which we'd like to replace with chunked_vectors. A candidate for such a replacement is token_range_vector (see #3335). This is a prerequisite for fixing some minor stalls; I don't expect we'll backport fixes to those stalls. Closes scylladb/scylladb#24162 * github.com:scylladb/scylladb: utils: chunked_vector: add swap() method utils: chunked_vector: add range insert() overloads utils: chunked_vector: relax static_assert utils: chunked_vector: implement erase() for single elements and ranges utils: chunked_vector: implement insert() for single-element inserts	2025-05-15 09:42:14 +03:00
Yaron Kaikov	f124b073b1	toolchain: set `scylla-driver` release based on tools/cqlsh In `install-dependencies.sh` we use hardcoded `scylla-driver` release. this version should be identical to `tools/cqlsh/requirements.txt` value. It's better to have once source for `scylla-driver` version. upading `install-dependancies.sh` to use the release from `tools/cqlsh` directly Removing `geomet` hardcoded version Also removing the support for `s390x` arch as we never use it Frozen toolchain regenerated. Optimized clang from * https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz * https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#23841	2025-05-15 06:08:14 +03:00
Pavel Emelyanov	2e83b0367f	api: Use structured bindings in get_built_indexes() code Shorter this way Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24155	2025-05-14 19:03:13 +03:00
Wojciech Mitros	5920647617	mv: remove queue length limit from the view update read concurrency semaphore Each view update is correlated to a write that generates it (aside from view building which is throttled separately). These writes are limited by a throttling mechanism, which effectively works by performing the writes with CL=ALL if ongoing writes exceed some memory usage limit When writes generate view updates, they usually also need to perform a read. This read goes through a read concurrency semaphore where it can get delayed or killed. The semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue. If the number of queued reads exceeds a specific limit, the view update will fail on the replica, causing inconsistencies. This limit is not necessary. When a read gets queued on the semaphore, the write that's causing the view update is paused, so the write takes part in the regular write throttling. If too many writes get stuck on view update reads, they will get throttled, so their number is limited and the number of queued reads is also limited to the same amount. In this patch we remove the specified queue length limit for the view update read concurrency semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write throttling mechanism. This may allow the queue grow longer than with the previous limit, but it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the remaining ones that get queued use a tiny amount of memory, less than the writes that generated them and which are getting limited directly. Fixes https://github.com/scylladb/scylladb/issues/23319 Closes scylladb/scylladb#24112	2025-05-14 18:29:30 +03:00
Botond Dénes	700a5f86ed	tools/scylla-nodetool: status: handle negative load sizes Negative load sizes don't make sense, but we've seen a case in production, where a negative number was returned by ScyllaDB REST API, so be prepared to handle these too. Fixes: scylladb/scylladb#24134 Closes scylladb/scylladb#24135	2025-05-14 18:28:29 +03:00
Avi Kivity	70be73d036	Merge 'Refactor out code from `test_restore_with_streaming_scopes`' from Robert Bindar Lots of code from this test can be reused in PR #23861. I'm splitting it now in this change so we can merge it cleanly as a separate patch. Refs #23564 Closes scylladb/scylladb#24105 * github.com:scylladb/scylladb: Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes Refactor out code from test_restore_with_streaming_scopes	2025-05-14 18:10:53 +03:00
Botond Dénes	9f8de9adc8	Merge 'Add ability to skip SSTables cleanup when loading them' from Pavel Emelyanov The non-streaming loading of sstables performs cleanup since recently [1]. For vnodes, unfortunately, cleanup is almost unavoidable, because of the nature of vnodes sharding, even if sstable is already clean. This leads to waste of IO and CPU for nothing. Skipping the cleanup in a smart way is possible, but requires too many changes in the code and in the on-disk data. However, the effort will not help existing SSTables and it's going to be obsoleted by tablets some time soon. Said that, the easiest way to skip cleanup is the explicit --skip-cleanup option for nodetool and respective skip_cleanup parameter for API handler. New feature, no backport fixes #24136 refs #12422 [1] Closes scylladb/scylladb#24139 * github.com:scylladb/scylladb: nodetool: Add refresh --skip-cleanup option api: Introduce skip_cleanup query parameter distributed_loader: Don't create owned ranges if skip-cleanup is true code: Push bool skip_cleanup flag around	2025-05-14 16:47:34 +03:00
Avi Kivity	13a75ff835	utils: chunked_vector: add swap() method Following std::vector(), we implement swap(). It's a simple matter of swapping all the contents. A unit test is added.	2025-05-14 16:19:40 +03:00
Avi Kivity	24e0d17def	utils: chunked_vector: add range insert() overloads Inserts an iterator range at some position. Again we insert the range at the end and use std::rotate() to move the newly inserted elements into place, forgoing possible optimizations. Unit tests are added.	2025-05-14 16:19:40 +03:00
Avi Kivity	9425a3c242	utils: chunked_vector: relax static_assert chunked_vector is only implemented for types with a non-throwing move constructor; this greatly simplifies the implementation. We have a static_assert to enforce it (should really be a constraint, but chunked_vector predates C++ concepts). This static_assert prevents forward declarations from compiling: class forward_declared; using a = utils::chunked_vector<forward_declared>; `a` won't compile since the static_assert will be instantiated and will fail since forward_declared is an incomplete type. Using a constraint has the same problem. Fix by moving the static_assert to the destructor. The destructor won't be instantiated by the forward declaration, so it won't trigger. It will trigger when someone destroys the vector; at this point the types are no longer forward declared.	2025-05-14 16:19:40 +03:00
Avi Kivity	d6eefce145	utils: chunked_vector: implement erase() for single elements and ranges Implement using std::rotate() and resize(). The elements to be erased are rotated to the end, then resized out of existence. Again we defer optimization for trivially copyable types. Unit tests are added. Needed for range_streamer with token_ranges using chunked_vector.	2025-05-14 16:19:37 +03:00
Botond Dénes	b491ae1039	Merge 'raft_sys_table_storage: avoid temp buffer when deserializing log_entry' from Petr Gusev The get_blob method linearizes data by copying it into a single buffer, which can cause 'oversized allocation' warnings. In this commit we avoid copying by creating input stream on top of the original fragmened managed bytes, returned by untyped_result_set_row::get_view. fixes scylladb/scylladb#23903 backport: no need, not a critical issue. Closes scylladb/scylladb#24123 * github.com:scylladb/scylladb: raft_sys_table_storage: avoid temporary buffer when deserializing log_entry serializer_impl.hh: add as_input_stream(managed_bytes_view) overload	2025-05-14 15:10:47 +03:00
Avi Kivity	5301f3d0b5	utils: chunked_vector: implement insert() for single-element inserts partition_range_compat's unwrap() needs insert if we are to use it for chunked_vector (which we do). Implement using push_back() and std::rotate(). emplace(iterator, args) is also implemented, though the benefit is diluted (it will be moved after construction). The implementation isn't optimal - if T is trivially copyable then using std::memmove() will be much faster that std::rotate(), but this complex optimization is left for later. Unit tests are added.	2025-05-14 14:54:59 +03:00
Robert Bindar	548a1ec20a	Refactor out code from test_restore_with_streaming_scopes part 5: check_data_is_back Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:39:01 +03:00
Robert Bindar	29309ae533	Refactor out code from test_restore_with_streaming_scopes part 4: compute_scope Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:39:01 +03:00
Robert Bindar	a0f0580a9c	Refactor out code from test_restore_with_streaming_scopes part 3: create_dataset Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:38:59 +03:00
Robert Bindar	5171ca385a	Refactor out code from test_restore_with_streaming_scopes part 2: take_snapshot Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:31:19 +03:00
Robert Bindar	f09bb20ac4	Refactor out code from test_restore_with_streaming_scopes part 1: create_cluster Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-05-14 11:30:40 +03:00
Andrzej Jackowski	8b660f0af7	test: add tests for prepared statement metadata consistency corner cases Implement corner-cases of prepared statement metadata, as described in scylladb#20860. Although the purpose of the test was to verify the newly implemented SCYLLA_USE_METADATA_ID protocol extension, the test also passes with scylla-driver 3.29.3 that doesn't implement the support for this extension. That is because the driver doesn't implement support for skip_metadata flag, so fresh metadata are included in every prepared statement response, regardless of the metadata_id. This change: - Add test_changed_prepared_statement_metadata_columns to verify a scenario when a number of columns changes in a table used by a prepared statement - Add test_changed_prepared_statement_metadata_types to verify a scenario when a type of a column changes in a table used by a prepared statement - Add test_changed_prepared_statement_metadata_udt to veriy a scenario when a UDT changes in a table used by a prepared statement I tested the code with a modified Python driver (ref. scylladb/python-driver#457): - If SKIP_METADATA is enabled (scylladb/python-driver@c1809c1) but not other changes are introduced, all three test cases fail. - If SKIP_METADATA is disabled (no scylladb/python-driver@c1809c1) all test cases pass because fresh metadata are included in each reply. - If SKIP_METADATA is enabled (scylladb/python-driver@c1809c1) and SCYLLA_USE_METADATA_ID extension is included (scylladb/python-driver@8aba164) all test cases pass and verifies the correctness the implementation.	2025-05-14 09:59:19 +02:00
Andrzej Jackowski	086df24555	transport: implement SCYLLA_USE_METADATA_ID support Metadata id was introduced in CQLv5 to make metadata of prepared statement consistent between driver and database. This commit introduces a protocol extension that allows to use the same mechanism in CQLv4. This change: - Introduce SCYLLA_USE_METADATA_ID protocol extension for CQLv4 - Introduce METADATA_CHANGED flag in RESULT. The flag cames directly from CQLv5 binary protocol. In CQLv4, the bit was never used, so we assume it is safe to reuse it. - Implement handling of metadata_id and METADATA_CHANGED in RESULT rows - Implement returning metadata_id in RESULT prepared - Implement reading metadata_id from EXECUTE - Added description of SCYLLA_USE_METADATA_ID in documentation Metadata_id is wrapped in cql_metadata_id_wrapper because we need to distinguish the following situations: - Metadata_id is not supported by the protocol (e.g. CQLv4 without the extension is used) - Metadata_id is supported by the protocol but not set - e.g. PREPARE query is being handled: it doesn't contain metadata_id in the request but the reply (RESULT prepared) must contain metadata_id - Metadata_id is supported by the protocol and set, any number of bytes >= 0 is allowed, according to the CQLv5 protocol specification Fixes scylladb/scylladb#20860	2025-05-14 09:59:16 +02:00
Andrzej Jackowski	c32aba93b4	cql3: implement metadata::calculate_metadata_id() CQLv5 introduced metadata_id, which is a checksum computed from column names and types, to track schema changes in prepared statements. This commit introduces calculate_metadata_id to compute such id for given metadata. Please note that calculate_metadata_id() produces different hashes than Cassandra's computeResultMetadataId(). We use SHA256 truncated to 128 bits instead of MD5. There are also two smaller technical differences: calculate_metadata_id() doesn't add unneeded zeros and it adds a length of a string when an sstring is being fed to the hasher. The difference is intentional because MD5 has known vulnerabilities, moreover we don't want to introduce any dependency between our metadata_id and Cassandra's. This change: - Add cql_metadata_id_type - Implement metadata::calculate_metadata_id() - Add boost tests to confirm correctness of the function	2025-05-14 09:33:16 +02:00
Michał Hudobski	8ea862f1e8	test/cqlpy: add custom index tests Unit tests checking the behavior of the added support for create custom index statement	2025-05-14 09:32:01 +02:00
Michał Hudobski	05daa8dded	index: support storing metadata for custom indices Added function returning custom index class name. Added printing custom index class name when using DESCRIBE. Changed validation to reflect current support of indices.	2025-05-14 09:32:00 +02:00
Łukasz Paszkowski	0327964d57	compaction: Extend compaction_result to collect more information The compaction_result struct has been extended with the following properties: + id of the shard the compaction took place on + type of the compaction + time when the compaction started + list of sstable files to be compacted + list of sstable files generated by compaction	2025-05-14 08:32:07 +02:00
Łukasz Paszkowski	0490068982	system_keyspace: Upgrade compaction_history table Currently, the system.compaction_history table miss precious information like the type of compaction (cleanup, major, resharding, etc) or the sstable generations involved (in and out) used countless times to diagnose issues. Thus, the commit extend the current definition of the table by adding the following columns: + "compaction_type" (text) + "started_at" (int) + "shard_id" (int) + "sstables_in" (list<sstableinfo_type>) + "sstables_out" (list<sstableinfo_type>) + "total_tombstone_purge_attempt" (long) + "total_tombstone_purge_failure_due_to_overlapping_with_memtable" (long) + "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable" (long) Furthermore, the commit introduces a new feature flag in order to prevent nodes from writing data to new columns when a cluster is not fully upgraded.	2025-05-14 08:32:05 +02:00
Łukasz Paszkowski	28d0c98dab	system_keyspace: Create UDT: sstableinfo_type The new user defined type holds the following information on sstable: + generation uuid; + origin text; + size long; and will be used by the system.compaction_history table to keep track of compacted files and the files being the result of this compaction.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	dc6f8881b8	system_keyspace: Extract compaction_history struct Move the compaction_history_entry struct to a seperate file. The intent of this change is to later re-use it in scylla-nodetool as it currently defines its own structure that is very similar.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	4c93b5292d	system_keyspace: Squeeze update_compaction_history parameters Since the number of statistics inserted into compaction_history table grows in time, the number of parameters in the method update_compaction_history grows as well. So instead, let's re-use the already existing compaction_history_entry structure to populate data from the compaction_manager to the system table.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	342e9a3f5c	compaction/compaction_manager: update_history accepts compaction_result as rvalue The compaction_result struct holding compaction's results and statistics is obtained immediatelly before the update_history is called. Move it instead of passing a cont reference.	2025-05-14 08:31:40 +02:00
Andrzej Jackowski	f8f710c95e	test: simplify pytest params in test_long_query_timeout_erm One of pytest parameters in test_long_query_timeout_erm.py was a CQL query containing spaces and special chars such as '', '(', ')', '{', '}'. After upgrading to Fedora 42, the test started to fail with the error "test.pylib.rest_client.HTTPError: HTTP error 404" with uri=`http://...[SELECT FROM {}-True-False].dev.1`. To prevent from such errors, this commit changes the parameter to a string without spaces and such special characters. Fixes: scylladb/scylladb#24124 Closes scylladb/scylladb#24130	2025-05-13 21:44:15 +03:00
Benny Halevy	2ceecc9d2a	generic_server: server: do_accepts: prevent gate_closed_exception do_accepts might be called after `_gate` was closed. In this case it should just return early rather than throw gate_closed_exception, similar to the it breaks from the infinite for loop when the _gate is closed. With this change, do_accepts (and consequently, _listeners_stopped), should never fail as it catches and ignores all exceptions in the loop. Fixes #23775 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23818	2025-05-13 20:00:04 +03:00
Pavel Emelyanov	c0796244bb	nodetool: Add refresh --skip-cleanup option The option "conflicts" with load-and-stream. Tests and doc included. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 19:07:38 +03:00
Pavel Emelyanov	1b1f653699	api: Introduce skip_cleanup query parameter Just copy the load_and_stream and primary_replica_only logic, this new option is the same in this sense. Throw if it's specified with the load_and_stream one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 17:06:28 +03:00
Pavel Emelyanov	ed3ce0f6af	distributed_loader: Don't create owned ranges if skip-cleanup is true In order to make reshard compaction task run cleanup, the owner-ranges pointer is passed to it. If it's nullptr, the cleanup is not performed. So to do the skip-cleanup, the easiest (but not the most apparent) way is not to initialize the pointer and keep it nullptr. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 16:52:15 +03:00
Pavel Emelyanov	4ab049ac8d	code: Push bool skip_cleanup flag around Just put the boolean into the callstack between API and distributed loader to reduce the churn in the next patches. No functional changes, flag is false and unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 16:51:21 +03:00
Dawid Mędrek	9ebd6df43a	locator/production_snitch_base: Reduce log level when property file incomplete We're reducing the log level in case the provided property file is incomplete. The rationale behind this change is related to how CCM interacts with Scylla: * The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties` configuration every 60 seconds. * When a new node is added to the cluster, CCM recreates the `cassandra-rackdc.properties` file for EVERY node. If those two processes start happening at about the same time, it may lead to Scylla trying to read a not-completely-recreated file, and an error will be produced. Although we would normally fix this issue and try to avoid the race, that behavior will be no longer relevant as we're making the rack and DC values immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem in the older versions of Scylla could bring a more serious regression. Having that in mind, this commit is a compromise between making CI less flaky and having minimal impact when backported. We do the same for when the format of the file is invalid: the rationale is the same. We also do that for when there is a double declaration. Although it seems impossible that this can stem from the same scenario the other two errors can (since if the format of the file is valid, the error is justified; if the format is invalid, it should be detected sooner than a doubled declaration), let's stay consistent with the logging level. Fixes scylladb/scylladb#20092 Closes scylladb/scylladb#23956	2025-05-13 13:59:39 +03:00
Andrei Chekun	c33c0d62e1	test.py: change pattern for cleaning .log files in testlog directory Currently, test.py will delete recursively all .log files under the testlog directory instead of cleaning only on testlog directory. With this change it will not go deeper to delete log files. We still have a method for cleaning the log files in modes directories. The downside of this solution, that we will need to explicitly tell all directories that we want to clean. Fixes: https://github.com/scylladb/scylladb/issues/24001 Closes scylladb/scylladb#24004	2025-05-13 13:58:36 +03:00
Anna Stuchlik	eed8373b77	doc: remove the redundant pages This commit removes two redundant pages and adds the related redirections. - The Tutorials page is a duplicate and is not maintained anymore. Having it in the docs hurts the SEO of the up-to-date Tutorias page. - The Contributing page is not helpful. Contributions-related information should be maintained in the project README file. Fixes https://github.com/scylladb/scylladb/issues/17279 Fixes https://github.com/scylladb/scylladb/issues/24060 Closes scylladb/scylladb#24090	2025-05-13 13:29:04 +03:00
Andrei Chekun	747f2b1301	docs: add more steps in installation of test.py Documentation for --gather-metric parameter was missing. This functionality can break regular flow of using test.py, because of possible misconfiguration of the cgroup on the local machine. Added explanation how to deal with potential issue of gathering metrics functionality and how to switch it off. Fixes: https://github.com/scylladb/scylladb/issues/20763 Closes scylladb/scylladb#24095	2025-05-13 13:08:18 +03:00
Ernest Zaslavsky	2d5c0f0cfd	encryption_test: Catch exact exception Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed. Start catching the exact exception that we expect to be thrown. Closes scylladb/scylladb#24065	2025-05-13 12:55:19 +03:00
Ernest Zaslavsky	4a7c847cba	database_test: Wait for the index to be created Just call `wait_until_built` for the index in question fix: https://github.com/scylladb/scylladb/issues/24059 Closes scylladb/scylladb#24117	2025-05-13 11:40:55 +03:00
Petr Gusev	f245b05022	raft_sys_table_storage: avoid temporary buffer when deserializing log_entry The get_blob() method linearizes data by copying it into a single buffer, which can trigger "oversized allocation" warnings. This commit avoids that extra copy by creating an input stream directly over the original fragmented managed bytes returned by untyped_result_set_row::get_view(). Fixes scylladb/scylladb#23903	2025-05-13 10:33:57 +02:00
Petr Gusev	6496ae6573	serializer_impl.hh: add as_input_stream(managed_bytes_view) overload It's useful to have it here so that people can find it easily.	2025-05-13 10:32:32 +02:00
Wojciech Mitros	bceb64fb5a	test_mv_tablets_replace: wait for tablet replicas to balance before working on them In the test test_tablet_mv_replica_pairing_during_replace we stop 2 out of 4 servers while using RF=2. Even though in the test we use exactly 4 tablets (1 for each replica of a base table and view), intially, the tablets may not be split evenly between all nodes. Because of this, even when we chose a server that hosts the view and a different server that hosts the base table, we sometimes stoped all replicas of the base or the view table because the node with the base table replica may also be a view replica. After some time, the tablets should be distributed across all nodes. When that happens, there will be no common nodes with a base and view replica, so the test scenario will continue as planned. In this patch, we add this waiting period after creating the base and view, and continue the test only when all 4 tablets are on distinct nodes. Fixes https://github.com/scylladb/scylladb/issues/23982 Fixes https://github.com/scylladb/scylladb/issues/23997 Closes scylladb/scylladb#24111	2025-05-12 16:17:48 +02:00
Nadav Har'El	248688473d	build: when compiling without -g, don't leave debugging information If Scylla is compiled without "-g" (this is, for example, the default in dev build mode), any static library that we link with it and contains any debugging information will cause the resulting executable to incorrectly look (e.g., to file(1) or to gdb) like it has debugging information. For more than three years now (see #10863 for historical context), the wasmtime.a library, which has debugging symbols, has caused this to happen. In this patch, if a certain build is compiled WITHOUT "-g", we add the "--strip-debug" option to the linker to remove the partial debugging information from the executable. Note that --strip-debug is not added in build modes which do use "-g", or if the user explicitly asked to add -g (e.g., "configure.py --cflags=-g"). Before this patch: $ file build/dev/scylla build/dev/scylla: ELF 64-bit LSB executable ... , with debug_info, not stripped Ater this patch: $ file build/dev/scylla build/dev/scylla: ELF 64-bit LSB executable ... , not stripped Fixes #23832. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23840	2025-05-12 15:42:17 +03:00
Ujjawal Kumar	35cd200789	ent/encryption/kms_host.cc: Change regex pattern to include hyphens in AWS profile names. Fixes #22430 Closes scylladb/scylladb#23805	2025-05-12 15:41:00 +03:00
Botond Dénes	746382257c	Merge 'compress: fix an internal error when a specific debug log is enabled' from Michał Chojnowski compress: fix an internal error when a specific debug log is enabled While iterating over the recent `69684e16d8`, series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)` to be an internal error, and later calling that anyway in a debug log. (Tests didn't catch it because there's no test which simultaneously enables the debug log and configures some table to have no compression). This proves that `algorithm_to_name` is too much of a footgun. Fix it so that calling `algorithm_to_name(algorithm::none)` is legal. In hindsight, I should have done that immediately. Fixes #23624 Fix for recently-added code, no backporting needed. Closes scylladb/scylladb#23625 * github.com:scylladb/scylladb: test_sstable_compression_dictionaries: reproduce an internal error in debug logging compress: fix an internal error when a specific debug log is enabled	2025-05-12 15:40:12 +03:00
Calle Wilund	b28413890b	encryption_at_rest_test: Add test cases for bad KMIP config on reboot Refs scylladb/scylla-enterprise#5321 Adds two small test cases, for slight variations on KMIP host config being missing when rebooting a node, and table/sstable resolution failing due to this. Mainly to verify that we fail as expected, without crashing. Closes scylladb/scylladb#23544	2025-05-12 15:39:05 +03:00
Nadav Har'El	7c24e09b0d	test/alternator: add some Alternator-over-HTTPS tests This patch adds a few tests for Alternator over HTTPS (encrypted HTTP, a.k.a. TLS or SSL). The tests are skipped unless run with "--https", so they will not be run in CI. Nevertheless, they are useful to improve our understanding on how DynamoDB works over HTTPS and can be a basis for adding more tests for HTTPS support. The included tests pass on both Alternator and AWS DynamoDB. One test checks that both TLS 1.2 and TLS 1.3 are properly supported, and if chosen by the client, are actually honored. The same test also checks that TLS 1.1 is not supported, and results with a proper error if attempted. Both AWS DynamoDB and Alterator support the same protocols. Another test verifies that HTTP (unencrypted) requests cannot be sent over an HTTPS port. This is important for security - an installation that chooses to allow only HTTPS wants users to only use encrypted connections, and would not want users to continue sending unencrypted requests to the HTTPS port. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23493	2025-05-12 15:38:33 +03:00
Kefu Chai	8320d703cd	scripts/open-coredump.sh: Add substitute-path hint in prompt message Add a substitute-path rule hint in the greeting message displayed before launching dbuild. This helps developers debug coredumps by correctly mapping source files. Background: - Scylla's Jenkins builds typically occur in /jenkins/workspace/scylla-${branch}/next - When debugging locally, source paths need remapping to match the build environment - The substitute-path rule allows GDB to locate source files correctly This change improves developer experience by providing the appropriate path substitution command directly in the prompt. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23038	2025-05-12 15:37:59 +03:00
Kefu Chai	46f7ff6cfc	docs: nodetool: reference "nodetool task" page * Rewrite the documentation for the "nodetool restore" command. * Clarify the relationship between the `--nowait` flag and asynchronous operation. * Reference the "nodetool task" page for managing background tasks. Fixes scylladb#21888 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22023	2025-05-12 15:37:22 +03:00
Botond Dénes	dff7e2fc2f	Merge 'gossiper: failure_detector_loop_for_node: abort send_gossip_echo using abort_source' from Benny Halevy Currently send_gossip_echo has a 22 seconds timeout during which _abort_source is ignored. Use a function-local abort_source to abort send_gossip_echo either on timeout or if _abort_source requested abort, and co_return in the latter case. Closes scylladb/scylladb#12296 * github.com:scylladb/scylladb: gossiper: make send_gossip_echo cancellable gossiper: add send_echo helper idl, message: make with_timeout and cancellable verb attributes composable gossiper: failure_detector_loop_for_node: ignore abort_requested_exception gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition	2025-05-12 15:35:30 +03:00
Pavel Emelyanov	5bd3df507e	sstables: Lazily access statistics for trace-level logging There's a message in sstable::get_gc_before_for_fully_expire() method that is trace-level and one of its argument finds a value in sstable statisitics. Finding the value is not quite cheap (makes a lookup in std::unordered_map) and for mostly-off trace messages is just a waste of cycles. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23910	2025-05-12 11:22:31 +03:00
Patryk Jędrzejczak	4d0538eecb	Merge 'test/cluster: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek In this PR, we're adjusting most of the cluster tests so that they pass with the `rf_rack_valid_keyspaces` configuration option enabled. In most cases, the changes are straightforward and require little to no additional insight into what the tests are doing or verifying. In some, however, doing that does require a deeper understanding of the tests we're modifying. The justification for those changes and their correctness is included in the commit messages corresponding to them. Note that this PR does not cover all of the cluster tests. There are few remaining ones, but they require a bit more effort, so we delegate that work to a separate PR. I tested all of the modified tests locally with `rf_rack_valid_keyspaces` set to true, and they all passed. Fixes scylladb/scylladb#23959 Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint. Closes scylladb/scylladb#23661 * https://github.com/scylladb/scylladb: test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite test/cluster: Disable rf_rack_valid_keyspaces in problematic tests test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity test/cluster/test_multidc.py: Adjust to RF-rack-validity test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity test/cluster: Adjust simple tests to RF-rack-validity	2025-05-12 09:41:07 +02:00
Aleksandra Martyniuk	2dcea5a27d	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055	2025-05-12 09:36:48 +03:00
Łukasz Paszkowski	113647550f	tools/scylla-nodetool: fix crash when rows_merged cells contain null Any empty object of the json::json_list type has its internal _set variable assigned to false which results in such objects being skipped by the json::json_builder. Hence, the json returned by the api GET//compaction_manager/compaction_history does not contain the field `rows_merged` if a cell in the system.compaction_history table is null or an empty list. In such cases, executing the command `nodetool compactionhistory` will result in a crash with the following error message: `error running operation: rjson::error (JSON assert failed on condition 'false'` The patch fixes it by checking if the json object contains the `rows_merged` element before processing. If the element does not exist, the nodetool will now produce an empty list. Fixes https://github.com/scylladb/scylladb/issues/23540 Closes scylladb/scylladb#23514	2025-05-12 09:00:48 +03:00
Avi Kivity	5e764d1de2	Merge 'Drop v2 and flat from reader and related names' from Botond Dénes Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names. Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant). The changes in this PR are entirely mechanical, mostly just search-and-replace. Code cleanup, no backport required. Closes scylladb/scylladb#24087 * github.com:scylladb/scylladb: test/boost/mutation_reader_another_test: drop v2 from reader and related names test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ test/boost/mutation_test: s/consumer_v2/consumer/ test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ readers/mutation_readers: s/generating_reader_v2/generating_reader/ readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ readers/mutation_source: s/make_reader_v2/make_mutation_reader/ readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ mutation/mutation_compactor: drop v2 from compactor and related names replica/table: s/make_reader_v2/make_mutation_reader/ mutation_writer: s/bucket_writer_v2/bucket_writer/ readers/queue: drop v2 from reader and related names readers/multishard: drop v2 from reader and related names readers/evictable: drop v2 from reader and related names readers/multi_range: remove flat from name	2025-05-11 22:22:35 +03:00
Botond Dénes	3ba5dd79e6	tools/scylla-nodetool: document exit codes in --help Closes scylladb/scylladb#24054	2025-05-11 22:18:29 +03:00
Dawid Mędrek	ee96f8dcfc	test/cluster/suite.yaml: Enable rf_rack_valid_keyspaces in suite Almost all of the tests have been adjusted to be able to be run with the `rf_rack_valid_keyspaces` configuration option enabled, while the rest, a minority, create nodes with it disabled. Thanks to that, we can enable it by default, so let's do that.	2025-05-10 16:30:51 +02:00
Dawid Mędrek	c4b32c38a3	test/cluster: Disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the test suite have proven to be more problematic in adjusting to RF-rack-validity. Since we'd like to run as many tests as possible with the `rf_rack_valid_keyspaces` configuration option enabled, let's disable it in those. In the following commit, we'll enable it by default.	2025-05-10 16:30:49 +02:00
Dawid Mędrek	c8c28dae92	test/cluster/test_tablets: Divide rack into two to adjust tests to RF-rack-validity Three tests in the file use a multi-DC cluster. Unfortunately, they put all of the nodes in a DC in the same rack and because of that, they fail when run with the `rf_rack_valid_keyspaces` configuration option enabled. Since the tests revolve mostly around zero-token nodes and how they affect replication in a keyspace, this change should have zero impact on them.	2025-05-10 16:30:46 +02:00
Dawid Mędrek	04567c28a3	test/cluster/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity We reduce the number of nodes and the RF values used in the test to make sure that the test can be run with the `rf_rack_valid_keyspaces` configuration option. The test doesn't seem to be reliant on the exact number of nodes, so the reduction should not make any difference.	2025-05-10 16:30:43 +02:00
Dawid Mędrek	d3c0cd6d9d	test/cluster/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity The change boils down to matching the number of created racks to the number of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`. This way, we ensure that the created keyspace will be RF-rack-valid and so we can run the test file even with the `rf_rack_valid_keyspaces` configuration option enabled. The change has no impact on the tests that use the function; the distribution of nodes across racks does not affect how repair is performed or what the tests do and verify. Because of that, the change is correct.	2025-05-10 16:30:40 +02:00
Dawid Mędrek	5d1bb8ebc5	test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair We assign the newly created nodes to multiple racks. If RF <= 3, we create as many racks as the provided RF. We disallow the case of RF > 3 to avoid trying to create an RF-rack-invalid keyspace; note that no existing test calls `create_table_insert_data_for_repair` providing a higher RF. The rationale for doing this is we want to ensure that the tests calling the function can be run with the `rf_rack_valid_keyspaces` configuration option enabled.	2025-05-10 16:30:37 +02:00
Dawid Mędrek	92f7d5bf10	test/cluster/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity We assign the nodes to the same DC, but multiple racks to ensure that the created keyspace is RF-rack-valid and we can run the test with the `rf_rack_valid_keyspaces` configuration option enabled. The changes do not affect what the test does and verifies.	2025-05-10 16:30:34 +02:00
Dawid Mędrek	4c46551c6b	test/cluster/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity We simply assign the nodes used in the test to seprate racks to ensure that the created keyspace is RF-rack-valid to be able to run the test with the `rf_rack_valid_keyspaces` configuration option set to true. The change does not affect what the test does and verifies -- it only depends on the type of nodes, whether they are normal token owners or not -- and so the changes are correct in that sense.	2025-05-10 16:30:31 +02:00
Dawid Mędrek	2882b7e48a	test/cluster/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity We parameterize the test so it's run with and without enforced RF-rack-valid keyspaces. In the test itself, we introduce a branch to make sure that we won't run into a situation where we're attempting to create an RF-rack-invalid keyspace. Since the `rf_rack_valid_keyspaces` option is not commonly used yet and because its semantics will most likely change in the future, we decide to parameterize the test rather than try to get rid of some of the test cases that are problematic with the option enabled.	2025-05-10 16:30:29 +02:00
Dawid Mędrek	73b22d4f6b	test/cluster/test_not_enough_token_owners.py: Adjust to RF-rack-validity We simply assign DC/rack properties to every node used in the test. We put all of them in the same DC to make sure that the cluster behaves as closely to how it would before these changes. However, we distribute them over multiple racks to ensure that the keyspace used in the test is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces` configuration option set to true. The distribution of nodes between racks has no effect on what the test does and verifies, so the changes are correct in that sense.	2025-05-10 16:30:26 +02:00
Dawid Mędrek	5b83304b38	test/cluster/test_multidc.py: Adjust to RF-rack-validity Instead of putting all of the nodes in a DC in the same rack in `test_putget_2dc_with_rf`, we assign them to different racks. The distribution of nodes in racks is orthogonal to what the test is doing and verifying, so the change is correct in that sense. At the same time, it ensures that the test never violates the invariant of RF-rack-valid keyspaces, so we can also run it with `rf_rack_valid_keyspaces` set to true.	2025-05-10 16:30:23 +02:00
Dawid Mędrek	9281bff0e3	test/cluster/object_store/test_backup.py: Adjust to RF-rack-validity We modify the parameters of `test_restore_with_streaming_scopes` so that it now represents a pair of values: topology layout and the value `rf_rack_valid_keyspaces` should be set to. Two of the already existing parameters violate RF-rack-validity and so the test would fail when run with `rf_rack_valid_keyspaces: true`. However, since the option isn't commonly used yet and since the semantics of RF-rack-valid keyspaces will most likely change in the future, let's keep those cases and just run them with the option disabled. This way, we still test everything we can without running into undesired failures that don't indicate anything.	2025-05-10 16:30:20 +02:00
Dawid Mędrek	dbb8835fdf	test/cluster: Adjust simple tests to RF-rack-validity We adjust all of the simple cases of cluster tests so they work with `rf_rack_valid_keyspaces: true`. It boils down to assigning nodes to multiple racks. For most of the changes, we do that by: * Using `pytest.mark.prepare_3_racks_cluster` instead of `pytest.mark.prepare_3_nodes_cluster`. * Using an additional argument -- `auto_rack_dc` -- when calling `ManagerClient::servers_add()`. In some cases, we need to assign the racks manually, which may be less obvious, but in every such situation, the tests didn't rely on that assignment, so that doesn't affect them or what they verify.	2025-05-10 16:30:18 +02:00
Botond Dénes	911aa64043	test/boost/mutation_reader_another_test: drop v2 from reader and related names For the test case test_mutation_reader_from_mutations_as_mutation_source, the v1/v2 distinction was hiding two identical test cases. One was removed.	2025-05-09 07:53:30 -04:00
Botond Dénes	466a8a2b64	test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	30625a6ef7	test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	1169ac6ac8	test/boost/mutation_test: s/consumer_v2/consumer/	2025-05-09 07:53:30 -04:00
Botond Dénes	17b667b116	test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/	2025-05-09 07:53:30 -04:00
Botond Dénes	5dd546ea2b	readers/mutation_readers: s/generating_reader_v2/generating_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	75fddbc078	readers/mutation_readers: s/delegating_reader_v2/delegating_reader/	2025-05-09 07:53:30 -04:00
Botond Dénes	2fc3e52b2b	readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	674d41e3e6	readers/mutation_source: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	327867aa8a	readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/	2025-05-09 07:53:29 -04:00
Botond Dénes	efc48caea5	readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/	2025-05-09 07:53:29 -04:00
Botond Dénes	7af0690762	mutation/mutation_compactor: drop v2 from compactor and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	b5170e27d0	replica/table: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	cc95dc8756	mutation_writer: s/bucket_writer_v2/bucket_writer/	2025-05-09 07:53:29 -04:00
Botond Dénes	3d2651e07c	readers/queue: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	ca7f557e86	readers/multishard: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	4d92bc8b2f	readers/evictable: drop v2 from reader and related names	2025-05-09 07:53:28 -04:00
Botond Dénes	7ba3c3fec3	readers/multi_range: remove flat from name	2025-05-09 07:53:25 -04:00
Avi Kivity	092a88c9b9	dist: drop the scylla-env package scylla-env was used to glue together support for older distributions. It hasn't been used for many years. Remove it. Closes scylladb/scylladb#23985	2025-05-09 14:10:00 +03:00
Raphael S. Carvalho	28056344ba	replica: Fix take_storage_snapshot() running concurrently to merge completion Some background: When merge happens, a background fiber wakes up to merge compaction groups of sibling tablets into main one. It cannot happen when rebuilding the storage group list, since token metadata update is not preemptable. So a storage group, post merge, has the main compaction group and two other groups to be merged into the main. When the merge happens, those two groups are empty and will be freed. Consider this scenario: 1) merge happens, from 2 to 1 tablet 2) produces a single storage group, containing main and two other compaction groups to be merged into main. 3) take_storage_snapshot(), triggered by migration post merge, gets a list of pointer to all compaction groups. 4) t__s__s() iterates first on main group, yields. 5) background fiber wakes up, moves the data into main and frees the two groups 6) t__s__s() advances to other groups that are now freed, since step 5. 7) segmentation fault In addition to memory corruption, there's also a potential for data to escape the iteration in take_storage_snapshot(), since data can be moved across compaction groups in background, all belonging to the same storage group. That could result in data loss. Readers should all operate on storage group level since it can provide a view on all the data owned by a tablet replica. The movement of sstable from group A to B is atomic, but iteration first on A, then later on B, might miss data that was moved from B to A, before the iteration reached B. By switching to storage group in the interface that retrieves groups by token range, we guarantee that all data of a given replica can be found regardless of which compaction group they sit on. Fixes #23162. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24058	2025-05-09 14:07:06 +03:00
Gleb Natapov	c6e1758457	topology coordinator: make decommissioning node non voter before completing the operation A decommissioned node is removed from a raft config after operation is marked as completed. This is required since otherwise the decommissioned node will not see that decommission has completed (the status is propagated through raft). But right after the decommission is marked as completed a decommissioned node may terminate, so in case of a two node cluster, the configuration change that removes it from the raft will fail, because there will no be quorum. The solution is to mark the decommissioning node as non voter before reporting the operation as completed. Fixes: #24026 Backport to 2025.2 because it fixes a potential hang. Don't backport to branches older than 2025.2 because they don't have `8b186ab0ff`, which caused this issue. Closes scylladb/scylladb#24027	2025-05-09 12:43:31 +02:00
Tomasz Grabiec	be2c3ad6fd	Merge 'logalloc_test: don't test performance in test background_reclaim' from Michał Chojnowski The test is failing in CI sometimes due to performance reasons. There are at least two problems: 1. The initial 500ms (wall time) sleep might be too short. If the reclaimer doesn't manage to evict enough memory during this time, the test will fail. 2. During the 100ms (thread CPU time) window given by the test to background reclaim, the `background_reclaim` scheduling group isn't actually guaranteed to get any CPU, regardless of shares. If the process is switched out inside the `background_reclaim` group, it might accumulate so much vruntime that it won't get any more CPU again for a long time. We have seen both. This kind of timing test can't be run reliably on overcommitted machines without modifying the Seastar scheduler to support that (by e.g. using thread clock instead of wall time clock in the scheduler), and that would require an amount of effort disproportionate to the value of the test. So for now, to unflake the test, this patch removes the performance test part. (And the tradeoff is a weakening of the test). After the patch, we only check that the background reclaim happens eventually. Fixes https://github.com/scylladb/scylladb/issues/15677 Backporting this is optional. The test is flaky even in stable branches, but the failure is rare. Closes scylladb/scylladb#24030 * github.com:scylladb/scylladb: logalloc_test: don't test performance in test `background_reclaim` logalloc: make background_reclaimer::free_memory_threshold publicly visible	2025-05-09 11:35:02 +02:00
Patryk Jędrzejczak	be4532bcec	Merge 'Correctly skip updating node's own ip address due to oudated gossiper data ' from Gleb Natapov Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node. Fixes: #22777 Backport to 2025.1 since this fixes a crash. Older version do not have the code. Closes scylladb/scylladb#24000 * https://github.com/scylladb/scylladb: test: add reproducer for #22777 storage_service: Do not remove gossiper entry on address change storage_service: use id to check for local node	2025-05-09 11:28:21 +02:00
Andrzej Jackowski	f53d733e89	docs: lwt: add two missing spaces Due to lack of spaces, two example queries were not displayed in the rendered version of the document. In result, the `SELECT * FROM movies.nowshowing;` query in the step 6. returned 6 rows instead of expected 8 rows.	2025-05-09 08:42:15 +02:00
Piotr Smaron	f740f9f0e1	cql: fix CREATE tablets KS warning msg Materialized Views and Secondary Indexes are yet another features that keyspaces with tablets do not support, but these were not listed in a warning message returned to the user on CREATE KEYSPACE statement. This commit adds the 2 missing features. Fixes: #24006 Closes scylladb/scylladb#23902	2025-05-08 17:18:43 +02:00
Tomasz Grabiec	fadfbe8459	Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Rewrite cql_server::connection::process_request_one to use seastar::futurize_invoke and try_catch<> instead of utils::result_try - Add new read_failure_exception_with_timeout and throws it in storage_proxy - Add sleep in CQL server when the new exception is caught - Catch local exceptions in Mapreduce Service and convert them to std::runtime_error. - Add get_cql_exclusive to manager_client.py - Add test_long_query_timeout_erm No backport needed - minor issue fix. Closes scylladb/scylladb#23156 * github.com:scylladb/scylladb: test: add test_long_query_timeout_erm test: add get_cql_exclusive to manager_client.py mapreduce: catch local read_failure_exception_with_timeout transport: storage_proxy: release ERM when waiting for query timeout transport: remove redundant references in process_request_one transport: fix the indentation in process_request_one transport: add futures in CQL server exception handling	2025-05-08 12:45:49 +02:00
Avi Kivity	2d2a2ef277	tools: toolchain: dbuild: support nested containers Pass through the local containers directory (it cannot be bind-mounted to /var/lib/containers since podman checks the path hasn't changed) with overrides to the paths. This allows containers to be created inside the dbuild container, so we can enlist pre-packaged software (such as opensearch) in test.py. If the container images are already downloaded in the host, they won't be downloaded again. It turns out that the container ecosystem doesn't support nested network namespaces well, so we configure the outer container to use host networking for the inner containers. It's useful anyway. The frozen toolchain now installs podman and buildah so there's something to actually drive those nested containers. We disable weak dnf dependencies to avoid installing qemu. The frozen toolchain is regenerated with optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#24020	2025-05-08 13:00:16 +03:00
Botond Dénes	4a802baccb	Merge 'compress: make sstable compression dictionaries NUMA-aware ' from Michał Chojnowski compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards. New functionality, added to a feature which isn't in any stable branch yet. No backporting. Closes scylladb/scylladb#23590 * github.com:scylladb/scylladb: test: add test/boost/sstable_compressor_factory_test compress: add some test-only APIs compress: rename sstable_compressor_factory_impl to dictionary_holder compress: fix indentation compress: remove sstable_compressor_factory_impl::_owner_shard compress: distribute compression dictionaries over shards test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version test: remove sstables::test_env::do_with()	2025-05-08 09:52:46 +03:00
Botond Dénes	e5d944f986	Merge 'replica: Fix use-after-free with concurrent schema change and sstable set update' from Raphael Raph Carvalho When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set. Fixes #22040. Closes scylladb/scylladb#23680 * github.com:scylladb/scylladb: replica: Fix use-after-free with concurrent schema change and sstable set update sstables: Implement sstable_set_impl::all_sstable_runs()	2025-05-08 06:56:16 +03:00
Petr Gusev	e6c3f954f6	main: check if current process group controls stdin tty test.py doesn't override stdin when starting Scylla, so when tests are run from a terminal, isatty() returns true and parsed command line output is not printed, which is inconvenient. In this commit we add a check if the current process group controls the stdin terminal. This serves two purposes: * improves the "interactive mode" check from #scylladb/scylladb#18309, as only the controlling process group can interact with the terminal. * solves the test.py problem above, because test.py runs scylla in a new session/process group (it calls setsid after fork), and is now correctly not considered interactive. Closes scylladb/scylladb#24047	2025-05-08 06:52:48 +03:00
Michał Chojnowski	746ec1d4e4	test/boost/mvcc_test: fix an overly-strong assertion in test_snapshot_cursor_is_consistent_with_merging The test checks that merging the partition versions on-the-fly using the cursor gives the same results as merging them destructively with apply_monotonically. In particular, it tests that the continuity of both results is equal. However, there's a subtlety which makes this not true. The cursor puts empty dummy rows (i.e. dummies shadowed by the partition tombstone) in the output. But the destructive merge is allowed (as an expection to the general rule, for optimization reasons), to remove those dummies and thus reduce the continuity. So after this patch we instead check that the output of the cursor has continuity equal to the merged continuities of version. (Rather than to the continuity of merged versions, which can be smaller as described above). Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did the same in a different test. Fixes https://github.com/scylladb/scylladb/issues/13642 Closes scylladb/scylladb#24044	2025-05-08 00:41:01 +02:00
Pavel Emelyanov	0a9675de01	sstable: Use fmt::to_string(sstable::filename()) to get component file path The stream sink abort() method wants to remove component file by its path. For that the path is calculated from storage prefix and component basename, but there's a filename() method for it already. SStable filenames shouldn't be considered as on-disk paths (see #23194), but places that want it should be explicit and format the filename to string by hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24039	2025-05-07 22:25:58 +03:00
Pavel Emelyanov	36baeaeb57	sstable: Move update_info_for_opened_data() method to private: block The method is internally called by ssatble itself to refresh its state after opening or assigning (from foreign info) data and index files. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24041	2025-05-07 20:58:34 +03:00
Pavel Emelyanov	c2ecc45db8	sstable: Remove validate argument from sstable::load_metadata() There are only two callers of the method and the one that wants validation (the sstable::load()) can do it on its own. This helps the other caller (schema loader) being simpler and shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24038	2025-05-07 20:57:37 +03:00
Michał Chojnowski	f075674ebe	test: add test/boost/sstable_compressor_factory_test Add a basic test for NUMA awareness of `default_sstable_compressor_factory`.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	518f04f1c4	compress: add some test-only APIs Will be needed by the test added in the next patch.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	66a454f61d	compress: rename sstable_compressor_factory_impl to dictionary_holder Since sstable_compressor_factory_impl no longer implements sstable_compressor_factory, the name can be misleading. Rename it to something closer to its new role.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	e952992560	compress: fix indentation Purely cosmetic.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	6b831aaf1b	compress: remove sstable_compressor_factory_impl::_owner_shard Before the series, sstable_compressor_factory_impl was directly accessed by multiple shards. Now, it's a part of a `sharded` data structure and is never directly from other shards, so there's no need to check for that. Remove the leftover logic.	2025-05-07 14:43:20 +02:00
Michał Chojnowski	1bcf77951c	compress: distribute compression dictionaries over shards We don't want each shard to have its own copy of each dictionary. It would unnecessary pressure on cache and memory. Instead, we want to share dictionaries between shards. Before this commit, all dictionaries live on shard 0. All other shards borrow foreign shared pointers from shard 0. There's a problem with this setup: dictionary blobs receive many random accesses. If shard 0 is on a remote NUMA node, this could pose a performance problem. Therefore, for each dictionary, we would like to have one copy per NUMA node, not one copy per the entire machine. And each shard should use the copy belonging to its own NUMA node. This is the main goal of this patch. There is another issue with putting all dicts on shard 0: it eats an assymetric amount of memory from shard 0. This commit spreads the ownership of dicts over all shards within the NUMA group, to make the situation more symmetric. (Dict owner is decided based on the hash of dict contents). It should be noted that the last part isn't necessarily a good thing, though. While it makes the situation more symmetric within each node, it makes it less symmetric across the cluster, if different node sizes are present. If dicts occupy 1% of memory on each shard of a 100-shard node, then the same dicts would occupy 100% of memory on a 1-shard node. So for the sake of cluster-wide symmetry, we might later want to consider e.g. making the memory limit for dictionaries inversely proportional to the number of shards.	2025-05-07 14:43:18 +02:00
Michał Chojnowski	8649adafa8	test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version In next patches, make_sstable_compressor_factory() will have to disappear. In preparation for that, we switch to a seastar::thread-dependent replacement.	2025-05-07 14:43:04 +02:00
Aleksandra Martyniuk	2549f5e16b	test_tablet_repair_hosts_filter: change injected error test_tablet_repair_hosts_filter checks whether the host filter specfied for tablet repair is correctly persisted. To check this, we need to ensure that the repair is still ongoing and its data is kept. The test achieves that by failing the repair on replica side - as the failed repair is going to be retried. However, if the filter does not contain any host (included_host_count = 0), the repair is started on no replica, so the request succeeds and its data is deleted. The test fails if it checks the filter after repair request data is removed. Fail repair on topology coordinator side, so the request is ongoing regardless of the specified hosts. Fixes: #23986. Closes scylladb/scylladb#24003	2025-05-07 15:30:05 +03:00
Michał Chojnowski	0e4d0ded8d	test: remove sstables::test_env::do_with() `sstable_manager` depends on `sstable_compressor_factory&`. Currently, `test_env` obtains an implementation of this interface with the synchronous `make_sstable_compressor_factory()`. But after this patch, the only implementation of that interface `sstable_compressor_factory&` will use `sharded<...>`, so its construction will become asynchronous, and the synchronous `make_sstable_compressor_factory()` must disappear. There are several possible ways to deal with this, but I think the easiest one is to write an asynchronous replacement for `make_sstable_compressor_factory()` that will keep the same signature but will be only usable in a `seastar::thread`. All other uses of `make_sstable_compressor_factory()` outside of `test_env::do_with()` already are in seastar threads, so if we just get rid of `test_env::do_with()`, then we will be able to use that thread-dependent replacement. This is the purpose of this commit. We shouldn't be losing much.	2025-05-07 13:19:21 +02:00
Nadav Har'El	7ccf77b84f	test/alternator: another test for UpdateExpression's SET I found on StackOverflow an interesting discussion about the fact that DynamoDB's UpdateExpression documentation "recommends" to use SET instead of ADD, and the rather convoluted expression that is actually needed to emulate ADD using SET: ``` SET #count = if_not_exists(#count, :zero) + :one ``` https://stackoverflow.com/questions/14077414/dynamodb-increment-a-key-value Although we do have separate tests for the different pieces of that idiom - a SET with missing attribute or item, the if_not_exists() function, etc. - I thought it would be nice to have a dedicated test that verifies that this idiom actually works, and moreover that the more naive "SET #count = #count + :one" does NOT work if the item or the attribute are missing. Unsurprisingly, the new test passes on both Alternator and DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23963	2025-05-07 13:57:50 +03:00
Nadav Har'El	b4a9fe9928	test/alternator: another test for expression with a lot of ORs We already have a test, test_limits.py::test_deeply_nested_expression_2, which checks that in the long condition expression a<b or (a<b or (a<b or (a<b or (....)))) with more than MAX_DEPTH (=400) repeats is rejected by Alternator, as part of commit `04e5082d52` which restricted the depth of the recursive parser to prevent crashing Scylla. However, I got curious what will happen without the parentheses: a<b or a<b or a<b or a<b or ... It turns out that our parser actually parses this syntax without recursion - it's just a loop (a "*" in the Antlr alternator/expressions.g allows reading more and more ORs in a loop). So Alternator doesn't limit the length of this expression more than the length limit of 4096 bytes which we also have. We can fit 584 repeats in the above expression in 4096 bytes, and it will not be rejected even though 584 > 400. This test confirms that this is indeed the case. The test is Scylla-only because on DynamoDB, this expression is rejected because it has more than 300 "OR" operators. Scylla doesn't have this specific limit - we believe the other limitations (on total expression length, and on depth) are better for protecting Scylla. Remember that in an expression like "(((((((((((((" there is a very high recursion depth of the parser but zero operators, so counting the operators does nothing to protect Scylla. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23973	2025-05-07 13:57:18 +03:00
Piotr Dulikowski	156ff8798b	topology_coordinator: silence ERROR messages on abort When the topology coordinator is shut down while doing a long-running operation, the current operation might throw a raft::request_aborted exception. This is not a critical issue and should not be logged with ERROR verbosity level. Make sure that all the try..catch blocks in the topology coordinator which: - May try to acquire a new group0 guard in the `try` part - Have a `catch (...)` block that print an ERROR-level message ...have a pass-through `catch (raft::request_aborted&)` block which does not log the exception. Fixes: scylladb/scylladb#22649 Closes scylladb/scylladb#23962	2025-05-07 13:51:41 +03:00
Aleksandra Martyniuk	20c2d6210e	streaming: skip dropped tables Currently, stream_session::prepare throws when a table in requests or summaries is dropped. However, we do not want to fail streaming if the table is dropped. Delete table checks from stream_session::prepare. Further streaming steps can handle the dropped table and finish the streaming successfully. Fixes: #15257. Closes scylladb/scylladb#23915	2025-05-07 11:51:56 +03:00
Anna Mikhlin	73b4c35601	Update ScyllaDB version to: 2025.3.0-dev	2025-05-07 11:43:11 +03:00
Pavel Emelyanov	6389099dfb	Merge 'test/cluster/test_read_repair.py: improve trace logging test (again)' from Botond Dénes The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen. Fixes: scylladb/scylladb#23968 Needs backport to 2025.1 and 6.2, both have the flaky test Closes scylladb/scylladb#23989 * github.com:scylladb/scylladb: test/cluster/test_read_repair.py: improve trace logging test (again) test/cluster: extract execute_with_tracing() into pylib/util.py	2025-05-07 10:32:45 +03:00
Botond Dénes	0a9ca52cfd	replica/database: memtable_list: save ref to memtable_table_shared_data This is passed by reference to the constructor, but a copy is saved into the _table_shared_data member. A reference to this member is passed down to all memtable readers. Because of the copy, the memtable readers save a reference to the memtable_list's member, which goes away together with the memtable_list when the storage_group is destroyed. This causes use-after-free when a storage group is destroyed while a memtable read is still ongoing. The memtable reader keeps the memtable alive, but its reference to the memtable_table_shared_data becomes stale. Fix by saving a reference in the memtable_list too, so memtable readers receive a reference pointing to the original replica::table member, which is stable accross tablet migrations and merges. The copy was introduced by `2a76065e3d`. There was a copy even before this commit, but in the previous vnode-only world this was fine -- there was one memtable_list per table and it was around until the table itself was. In the tablet world, this is no longer given, but the above commit didn't account for this. A test is included, which reproduces the use-after-free on memtable migration. The test is somewhat artificial in that the use-after-free would be prevented by holding on to an ERM, but this is done intentionaly to keep the test simple. Migration -- unlike merge where this use-after-free was originally observed -- is easy to trigger from unit tests. Fixes: #23762 Closes scylladb/scylladb#23984	2025-05-06 22:13:17 +03:00
Michał Chojnowski	1c1741cfbc	logalloc_test: don't test performance in test `background_reclaim` The test is failing in CI sometimes due to performance reasons. There are at least two problems: 1. The initial 500ms (wall time) sleep might be too short. If the reclaimer doesn't manage to evict enough memory during this time, the test will fail. 2. During the 100ms (thread CPU time) window given by the test to background reclaim, the `background_reclaim` scheduling group isn't actually guaranteed to get any CPU, regardless of shares. If the process is switched out inside the `background_reclaim` group, it might accumulate so much vruntime that it won't get any more CPU again for a long time. We have seen both. This kind of timing test can't be run reliably on overcommitted machines without modifying the Seastar scheduler to support that (by e.g. using thread clock instead of wall time clock in the scheduler), and that would require an amount of effort disproportionate to the value of the test. So for now, to unflake the test, this patch removes the performance test part. (And the tradeoff is a weakening of the test).	2025-05-06 18:59:18 +02:00
Michał Chojnowski	c47f438db3	logalloc: make background_reclaimer::free_memory_threshold publicly visible Wanted by the change to the background_reclaim test in the next patch.	2025-05-06 18:59:18 +02:00
David Garcia	b1ee0e2a6a	docs: fix AttributeError with 'myst_enable_extensions' in publication workflow Rolled back some dependencies in `poetry.lock` to previous versions while we investigate how to make the extension `sphinx_scylladb_markdown` compatible with the latest versions. This should fix the error in https://github.com/scylladb/scylladb/actions/runs/14708656912/job/41275115239, which currently prevents publishing new versions of https://opensource.docs.scylladb.com/ Closes scylladb/scylladb#23969	2025-05-06 16:33:00 +03:00
Pavel Emelyanov	1b5bbc2433	Merge 'test.py: split boost pytest integration' from Andrei Chekun This PR contains changes that do not add new functionality, and have small refactoring of the existing code. The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial` This is part two of splitting the original PR. This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278. Closes scylladb/scylladb#23882 * github.com:scylladb/scylladb: test.py: add awareness of extra_scylla_cmdline_options test.py: increase timeout for C++ tests in pytest test.py: switch method of finding the root repo directory test.py: move get_combined_tests to the correct facade test.py: add common directory for reports test.py: add the possibility to provide additional env vars test.py: move setup cgroups to the generic method test.py: refactor resource_gather.py	2025-05-06 16:22:49 +03:00
Raphael S. Carvalho	434c2c4649	replica: Fix use-after-free with concurrent schema change and sstable set update When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set, since patch "sstables: Implement sstable_set_impl::all_sstable_runs()". Fixes #22040. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:55 -03:00
Raphael S. Carvalho	628bec4dbd	sstables: Implement sstable_set_impl::all_sstable_runs() With upcoming change where table::set_compaction_strategy() might delay update of sstable set, ICS might temporarily work with sstable set implementations other than partitioned_sstable_set. ICS relies on all_sstable_runs() during regular compaction, and today it triggers bad_function_call exception if not overriden by set implementation. To remove this strong dependency between compaction strategy and a particular set implementation, let's provide a default implementation of all_sstable_runs(), such that ICS will still work until the set is updated eventually through a process that adds or remove a sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:06 -03:00
Botond Dénes	3c3f6ca233	tools/scylla-sstable: scrub: use UUID sstable identifiers Much easier to avoid sstable collisions. Makes it possible to scrub multiple sstables, with multiple calls to scylla-sstable, reusing the same output directory. Previously, each new call to scylla-sstable scrub, would start from generation 0, guaranteeing collision. Remove the unit test for generation clash -- with UUID generations, this is no longer possible to reproduce in practice. Refs: #21387 Closes scylladb/scylladb#23990	2025-05-06 15:09:53 +03:00
Patryk Jędrzejczak	7f843e0a5c	Merge 'raft: make sure to retain the existing voters including the current leader (topology coordinator)' from Emil Maskovsky Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack. Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election. To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. Fixes: scylladb/scylladb#23950 Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786 No backport: The limited voters feature is currently only present in master. Closes scylladb/scylladb#23888 * https://github.com/scylladb/scylladb: raft: ensure topology coordinator retains votership raft: retain existing voters across data centers and racks raft: refactor limited voters calculator to prioritize nodes raft: replace pointer with reference for non-null output parameter raft: reduce code duplication in group0 voter handler raft: unify and optimize datacenter and rack info creation	2025-05-06 13:49:55 +02:00
Nadav Har'El	252c5b5c9d	Merge 'Alternator batch_write_item wcu' from Amnon Heiman This series adds support for WCU tracking in batch_write_item and tests it. The patches include: Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users. Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object. Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested. The return handling was refactored to be coroutine-like for easier management of the consumed capacity array. Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly. Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently. Need backport, WCU is partially supported, and is missing from batch_write_item Fixes #23940 Closes scylladb/scylladb#23941 * github.com:scylladb/scylladb: alternator/test_metrics.py: batch_write validate WCU alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU alternator/executor: add WCU for batch_write_items alternator/consumed_capacity: make wcu get_units public Alternator: Change the WCU/RCU to use units	2025-05-06 13:31:53 +03:00
Gleb Natapov	7403de241c	test: add reproducer for #22777 Add sleep before starting gossiper to increase a chance of getting old gossiper entry about yourself before updating local gossiper info with new IP address.	2025-05-06 11:21:17 +03:00
Botond Dénes	29eedaa0e5	test/cluster/test_read_repair.py: improve trace logging test (again) The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.	2025-05-06 01:35:17 -04:00
Avi Kivity	fc2204cea0	Merge ' test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits' from Botond Dénes This test has multiple problems: * has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead * initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail * duplicate check of drops == 0 (just cosmetic) Fix all three problems, the second is especially important because it made the test flaky. Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them. Fixes: #16794 Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there Closes scylladb/scylladb#23783 * github.com:scylladb/scylladb: test/boost/multishard_mutation_query_test: ensure test runs with vnodes test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits	2025-05-05 20:49:03 +03:00
Emil Maskovsky	24dfd2034b	raft: ensure topology coordinator retains votership The limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the current topology coordinator, triggering an unnecessary Raft leader re-election. This change ensures that the existing topology coordinator's votership status is preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the topology coordinator is prioritized for removal. This helps maintain stability in the cluster by avoiding unnecessary leader re-elections. Additionally, only the alive leader node is considered relevant for this logic. A dead existing leader (topology coordinator) is excluded from consideration, as it is already in the process of losing leadership. Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786	2025-05-05 16:58:34 +02:00
Emil Maskovsky	2ae59e8a87	raft: retain existing voters across data centers and racks Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing voters in each data center and rack. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. Fixes: scylladb/scylladb#23950	2025-05-05 16:51:48 +02:00
Emil Maskovsky	018fb63305	raft: refactor limited voters calculator to prioritize nodes Refactor the limited voters calculator to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. The priority value is determined based on the node's existing status, including whether it is alive, a voter, or any further criteria.	2025-05-05 16:36:17 +02:00
Emil Maskovsky	26fdc7b8f8	raft: replace pointer with reference for non-null output parameter The output parameter cannot be `null`. Previously, a pointer was used to make it explicit that the parameter is an output parameter being modified. However, this is unnecessary, as references are more appropriate for parameters that cannot be `null`. Switching to a reference improves code readability and ensures the parameter's non-null constraint is enforced at the type level.	2025-05-05 16:12:00 +02:00
Emil Maskovsky	f0468860a3	raft: reduce code duplication in group0 voter handler Refactor the group0 voter handler by introducing a helper lambda to handle the common logic for adding a node. This eliminates unnecessary code duplication. This refactor does not introduce any functional changes but prepares the codebase for easier future modifications.	2025-05-05 16:09:53 +02:00
Botond Dénes	855411caad	test/boost/multishard_mutation_query_test: ensure test runs with vnodes All tests in this suite use the default "ks" keyspace from cql_test_env. This keyspace has tablet support and at any time we might decide to make it use tablets by default. This would make all these tests use the tablet path in multishard_mutation_query.cc. These tests were created to test the vastly more complex vnodes code path in said file. The tablet path is much simpler and it is only used by SELECT * FROM MUTATION_FRAGMENTS() and which has its own correctness tests. So explicitely create a vnodes keyspace and use it in all the tests to restore the test functionality.	2025-05-05 09:22:54 -04:00
Botond Dénes	1175e1ed49	test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits This test has multiple problems: * has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead * initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail * duplicate check of drops == 0 (just cosmetic) Fix all three problems, the second is especially important because it made the test flaky.	2025-05-05 09:22:53 -04:00
Emil Maskovsky	2ef654149f	raft: unify and optimize datacenter and rack info creation Refactor the code to use a consistent pattern for creating the datacenter info list and the rack info list. Both now use a map of vectors, which improves efficiency by reducing temporary conversions to maps/sets during node list processing. Also ensure the node descriptor is passed by reference instead of by copy, leveraging the guaranteed lifetime of the descriptors.	2025-05-05 15:15:17 +02:00
Pavel Emelyanov	cf1ffd6086	Merge 'sstables_loader: fix the racing between get_progress() and release_resources()' from Kefu Chai This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them. Problem: - Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data) - `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard` - As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard` - If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory. Solution: - Implemented shared/exclusive locking to protect access to both state and sharded progress data - Multiple `get_progress()` calls can execute in parallel (shared access) - `release_resources()` acquires exclusive access before modifying resources - This prevents potential memory corruption and ensures consistent progress reporting Fixes #23801 --- this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport. Closes scylladb/scylladb#23808 * github.com:scylladb/scylladb: sstables_loader: fix the indent sstables_loader: fix the racing between get_progress() and release_resources()	2025-05-05 15:45:15 +03:00
Avi Kivity	e688e89430	tools: toolchain: clear .cache and .cargo directories The .cache and .cargo directories are used during pip and rust builds when preparing the toolchain, but aren't useful afterwards. Remove them to save a bit of space. Closes scylladb/scylladb#23955	2025-05-05 14:43:14 +03:00
Avi Kivity	4c1f4c419c	tools: toolchain: dbuild: run as root in container under podman Running as root enables nested containers under podman without trouble from uid remapping. Unlike docker, under podman uid 0 in the container is remapped to the host uid for bind mounts, so writes to the build directory do not end up owned by root on the host. Nested containers will allow us to consume opensearch, cassandra-stress, and minio as containers rather than embedding them into the frozen toolchain. Closes scylladb/scylladb#23954	2025-05-05 14:40:43 +03:00
Amnon Heiman	2ab99d7a07	alternator/test_metrics.py: batch_write validate WCU This patch adds a test that verifies the WCU metrics are updated correctly during a batch_write_item operation. It ensures that the WCU of each item is calculated independently. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:20:24 +03:00
Amnon Heiman	14570f1bb5	alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU This patch adds two tests: A test that validates WCU calculation for batch put requests on a single table. A test that validates WCU calculation for batch requests across multiple tables, including ensuring that delete operations are counted as 1 WCU. Both tests verify that the consumed capacity is reported correctly according to the WCU rules. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:20:23 +03:00
Amnon Heiman	68db77643f	alternator/executor: add WCU for batch_write_items This patch adds consumed capacity unit support to batch_write_item. It calculates the WCU based on an item's length (for put) or a static 1 WCU (for delete), for each item on each table. The WCU metrics are always updated. if the user requests consumed capacity, a vector of consumed capacity is returned with an entry for each of the tables. For code simplicity, the return part of batch_write_item was updated to be coroutine-like; this makes it easier to manage the life cycle of the returned consumed_capacity array. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:20:14 +03:00
Amnon Heiman	f2ade71f4f	alternator/consumed_capacity: make wcu get_units public This patch adds a public static get_units function to wcu_consumed_capacity_counter. It will be used by the batch write item implementation, which cannot use the wcu_consumed_capacity_counter directly. Signed-off-by: Amnon Heiman <amnon@scylladb.com> consume_capacity need merge	2025-05-05 13:19:04 +03:00
Amnon Heiman	5ae11746fa	Alternator: Change the WCU/RCU to use units This patch changes the RCU/WCU Alternator metrics to use whole units instead of half units. The change includes the following: Change the metrics documentation. Keep the RCU counter internally in half units, but return the actual (whole unit) value. Change the RCU name to be rcu_half_units_total to indicates that it counts half units. Change the WCU to count in whole units instead of half units. Update the tests accordingly. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-05-05 13:18:09 +03:00
Anna Stuchlik	851a433663	doc: add a link to the previous Enterprise documentation This commit adds a link to the docs for previous Enterprise versions at https://enterprise.docs.scylladb.com/ to the left menu. As we still support versions 2024.1 and 2024.2, we need to ensure easier access to those docs sets. Fixes https://github.com/scylladb/scylladb/issues/23870 Closes scylladb/scylladb#23945	2025-05-05 12:16:47 +03:00
Avi Kivity	04fb2c026d	config: decrease default large allocation warning threshold to 128k Back in 2017 (`5a2439e702`), we introduced a check for large allocations as they can stall the memory allocator. The warning threshold was set at 1 MB. Since then many fixes for large allocations went in and it is now time to reduce the threshold further. We reduce it here to 128 kB, the natural allocation size for the system. A quick run showed no warnings. Closes scylladb/scylladb#23975	2025-05-05 12:13:48 +03:00
Pavel Emelyanov	b56d6fbb84	Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. Closes scylladb/scylladb#23806 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-05-05 11:28:38 +03:00
David Garcia	4ba7182515	docs: fix md redirections for multiversion support This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page. The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files. This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases. Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment. Closes scylladb/scylladb#23957	2025-05-05 10:39:39 +03:00
Pavel Emelyanov	7b786d9398	topology_coordinator: Use this->_feature_service directly This dependency is already there, topology coordinator doesn't need to use database reference to get to the features. Previous patch of the same kind: `b79137eaa4` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23777	2025-05-05 09:37:29 +02:00
Piotr Dulikowski	05c797795f	Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet. Closes scylladb/scylladb#23815 * github.com:scylladb/scylladb: test: Remove sstable_assertions::get_stats_metadata() test: Add sstable_assertions::operator->()	2025-05-05 09:33:45 +02:00
Nadav Har'El	834107ae97	test/cqlpy,alternator: fix reporting of Scylla crash during test The cqlpy and alternator test frameworks use a single Scylla node started once for all tests to run on. In the distant past, we had a problem where if one test caused Scylla to crash, the result was a confusing report of hundreds of failed tests - all tests after the crash "failed" and it wasn't easy to find which test really caused the crash. Our old solution to this problem was to have an autouse fixture (called cql_test_connection or dynamodb_test_connection) which tested the connection at the end of each test, and if it detected Scylla has crashed - it used pytest.exit() to report the error and have pytest exit and therefore stop running any further tests (which would have led to all of them testing). This approach had two problems: 1. The pytest.exit() caused the entire cqlpy suite to report a failure, but but not the individual test - the individual test might have failed as well, but that isn't guaranteed and in any case this test's output is missing the informative message that Scylla crashed during the test. This was fine when for each cqlpy failure we had two separate error logs in Jenkins - the specific failed function, and the failed file - but when we recently got rid of the suplication by removing the second one, we no longer see the "Scylla crashed" messages any more. 2. Exiting pytest will be the wrong thing to do if the same pytest run could run tests from different test suites. We don't do this today, but we plan to support this approach soon. This patch fixes both problems by replacing the pytest.exit() call by setting a "scylla_crashed" flag and using pytest.fail(). The pytest.fail() causes the current test - the one which caused Scylla to crash - to be reported as an "ERROR" and the "Scylla crashed" message will correctly appear in this test's log. The flag will cause all other tests in the same test suite to be skip()ed. But other tests in other directories, depending on different fixtures, might continue to run normally. Fixes #23287 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23307	2025-05-05 10:15:56 +03:00
Nadav Har'El	3ce7e250cc	alternator: fix schema "concurrent modification" errors In ScyllaDB, schema modification operations use "optimistic locking": A schema operation reads the current schema, decides what it wants to do and prepares changes to the schema, and then attempts to commit those changes - but only if the schema hasn't changed since the first read. If the schema has already been changed by some other node - we need to try again. In a loop. In Alternator, there are six operations that perform schema modification: CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and UpdateTimeToLive. All of them were missing this loop. We knew about this - and even had FIXME in all places. So all these operations, when facing contention of concurrent schema modifications on different nodes may fail one of these operations with an error like: Internal server error: service::group0_concurrent_modification (Failed to apply group 0 change due to concurrent modification). This problem had very minor effect, if any, on real users because the DynamoDB SDK automatically retries operations that fail with retryable errors - like this "Internal server error" - and most likely the schema operation will succeed upon retry. However, as shown in issue #13152 these failures were annoying in our CI, where tests - which disable request retries - failed on these errors. This patch fixes all six operations (the last three operations all use one common function, db::modify_tags(), so are fixed by one change) to add the missing loop. The patch also includes reproducing tests for all these operations - the new tests all fail before this patch, and pass with it. These new tests are much more reliable reproducers than the dtests we had that only sometimes - very rarely - reproduced the problem. Moreover, the new tests reproduces the bug seperately for each of the six operations, so if we forget to fix one of the six operations, one of the tests would have continued to fail. Of course I checked this during development. The new tests are in the test/cluster framework, not test/alternator, because this problem can only be reproduced in a multi-node cluster: On a single node, it serializes its schema modifications on its own; The collisions only happen when more than one node attempts schema modifications at the same time. Fixes #13152 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23827	2025-05-05 09:59:08 +03:00
Pavel Emelyanov	d40d6801b0	sstable_directory: Print ks.cf when moving unshared remove sstables When an sstable is identified by sstable_directory as remote-unshared, it will at some point be moved to the target shard. When it happens a log-message appears: sstable_directory - Moving 1 unshared SSTables to shard 1 Processing of tables by sstable_directory often happens in parallel, and messages from sstable_directory are intermixed. Having a message like above is not very informative, as it tells nothing about sstables that are being moved. Equip the message with ks:cf pair to make it more informative. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23912	2025-05-05 09:45:44 +03:00
Pavel Emelyanov	e0f30a30a7	sstable_directory: Print unshared remote sstable when sorting When collecting sstables, the sstable_directory may sort the collected descriptors into one of three buckets -- unshared local and remote, and shared ones. Unshared local and shared sstables' paths are loggerd (with trace level) while unshared remote is silently collected for further processing. Add log message for that case too, there's enough data to print the sstable path as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23913	2025-05-05 09:33:06 +03:00
Gleb Natapov	ecd14753c0	storage_service: Do not remove gossiper entry on address change When gossiper indexed entries by ip an old entry had to be removed on an address change, but the index is id based, so even if ip was change the entry should stay. Gossiper simply updates an ip address there.	2025-05-04 17:59:07 +03:00
Gleb Natapov	a2178b7c31	storage_service: use id to check for local node IP may change and an old gossiper message with previous IP may be processed when it shouldn't. Fixes: #22777	2025-05-04 17:59:07 +03:00
Botond Dénes	51025de755	test/cluster: extract execute_with_tracing() into pylib/util.py To allow reuse in other tests.	2025-05-02 01:53:35 -04:00
Piotr Dulikowski	8ffe4b0308	utils::loading_cache: gracefully skip timer if gate closed The loading_cache has a periodic timer which acquires the _timer_reads_gate. The stop() method first closes the gate and then cancels the timer - this order is necessary because the timer is re-armed under the gate. However, the timer callback does not check whether the gate was closed but tries to acquire it, which might result in unhandled exception which is logged with ERROR severity. Fix the timer callback by acquiring access to the gate at the beginning and gracefully returning if the gate is closed. Even though the gate used to be entered in the middle of the callback, it does not make sense to execute the timer's logic at all if the cache is being stopped. Fixes: scylladb/scylladb#23951 Closes scylladb/scylladb#23952	2025-04-30 16:43:22 +03:00
Benny Halevy	4bd0845fce	gossiper: make send_gossip_echo cancellable Currently send_gossip_echo has a 22 seconds timeout during which _abort_source is ignored. Mark the verb as cancellable so it can be canceled on shutdown / abort. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:46:10 +03:00
Benny Halevy	fa1c3e86a9	gossiper: add send_echo helper CAll send_gossip_echo using a centralized helper. A following patch will make it abortable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:45:51 +03:00
Benny Halevy	0b97806771	idl, message: make with_timeout and cancellable verb attributes composable And define `send_message_timeout_cancellable` in rpc_protocol_impl.hh using the newly introduced rpc_handler entry point in seastar that accepts both timeout and cancellable params. Note that the interface to the user still uses abort_source while internally the funtion allocates a seastar::rpc::cancellable object. It is possible to provide an interface that will accept a rpc::cancellable& from the caller, but the existing messaging api uses abort_source. Changing it may be considered in the future. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:45:51 +03:00
Benny Halevy	e06d226d08	gossiper: failure_detector_loop_for_node: ignore abort_requested_exception Aborting the failure detector happens normally when the node shuts down. There's no need to log anything about it, as long as we abort the function cleanly. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:05:24 +03:00
Benny Halevy	83c69642f7	gossiper: failure_detector_loop_for_node: check if abort_requested in loop condition The same as the loop condition in the direct_failure_detector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-30 11:05:24 +03:00
Aleksandra Martyniuk	1f4edd8683	test_tablet_tasks: use injection to revoke resize Currently, test_tablet_resize_revoked tries to trigger split revoke by deleting some rows. This method isn't deterministic and so a test is flaky. Use error injection to trigger resize revoke. Fixes: #22570. Closes scylladb/scylladb#23966	2025-04-30 07:04:57 +03:00
Michał Chojnowski	9e2343ecb0	test_sstable_compression_dictionaries_autotrain: raise the timeout There were CI runs in which the training happened as planned, but it was too slow to fit within the timeout. Raise the timeout to pacify the CI. Fixes scylladb/scylladb#23964 Closes scylladb/scylladb#23965	2025-04-29 22:09:14 +03:00
Raphael S. Carvalho	d5bee4c814	test: Verify partitioned set store split and unsplit correctly Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	59dad2121f	compaction: Introduce token_range() to table_state This provides a way for compaction layer to know compaction group's token range. It will be important for sstable set impl to know the token range of underlying group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	494ed6b887	dht: Add overlap_ratio() for token range Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Patryk Jędrzejczak	0cdcf82cd0	Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897 From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them. Closes scylladb/scylladb#23914 * https://github.com/scylladb/scylladb: test: cluster: add test_bad_initial_token topology coordinator: do not proceed further on invalid boostrap tokens cdc: add sanity check for generating an empty generation	2025-04-28 12:45:33 +02:00
Michał Chojnowski	7f9152babc	utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity() `chunked_managed_vector` is a vector-like container which splits its contents into multiple contiguous allocations if necessary, in order to fit within LSA's max preferred contiguous allocation limits. Each limited-size chunk is stored in a `managed_vector`. `managed_vector` is unaware of LSA's size limits. It's up to the user of `managed_vector` to pick a size which is small enough. This happens in `chunked_managed_vector::max_chunk_capacity()`. But the calculation is wrong, because it doesn't account for the fact that `managed_vector` has to place some metadata (the backreference pointer) inside the allocation. In effect, the chunks allocated by `chunked_managed_vector` are just a tiny bit larger than the limit, and the limit is violated. Fix this by accounting for the metadata. Also, before the patch `chunked_managed_vector::max_contiguous_allocation`, repeats the definition of logalloc::max_managed_object_size. This is begging for a bug if `logalloc::max_managed_object_size` changes one day. Adjust it so that `chunked_managed_vector` looks directly at `logalloc::max_managed_object_size`, as it means to.	2025-04-28 12:30:13 +02:00
Botond Dénes	d582c436e5	Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children. Fixes: https://github.com/scylladb/scylladb/issues/22514. Needs backport to 2025.1 and 6.2 as they contain the bug. Closes scylladb/scylladb#23787 * github.com:scylladb/scylladb: test: add test for getting tasks children tasks: check whether a node is alive before rpc	2025-04-28 09:32:45 +03:00
Nadav Har'El	262530f27c	Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as we'll as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). To achieve this, in this series we remove all base_info members that can change due to a base schema update, and we calculate the remaining values during view update generation, using the most up-to-date base schema version. To calculate the values that depend on the base schema version, we need to iterate over the view primary key and find the corresponding columns, which adds extra overhead for each batch of view updates. However, this overhead should be relatively small, as when creating a view update, we need to prepare each of its columns anyway. And if we need to read the old value of the base row, the relative overhead is even lower. After this change, the base info in view schemas stays the same for all base schema updates, so we'll no longer get issues with base_info being incompatible with a base schema version. Additionally, it's a step towards making the schema objects immutable, which we sometimes incorrectly assumed in the past (they're still not completely immutable yet, as some other fields in view_info other than base_info are initialized lazily and may depend on the base schema version). Fixes https://github.com/scylladb/scylladb/issues/9059 Fixes https://github.com/scylladb/scylladb/issues/21292 Fixes https://github.com/scylladb/scylladb/issues/22194 Fixes https://github.com/scylladb/scylladb/issues/22410 Closes scylladb/scylladb#23337 * github.com:scylladb/scylladb: test: remove flakiness from test_schema_is_recovered_after_dying mv: add a test for dropping an index while it's building base_info: remove the lw_shared_ptr variant view_info: don't re-set base_info after construction base_info: remove base_info snapshot semantics base_info: remove base schema from the base_info schema_registry: store base info instead of base schema for view entries base_info: make members non-const view_info: move the base info to a separate header view_info: move computation of view pk columns not in base pk to view_updates view_info: move base-dependent variables into base_info view_info: set base info on construction	2025-04-27 19:12:12 +03:00
David Garcia	cf7d846b9e	docs: update dependencies This is a mandatory dependency update to resolve a critical Dependabot alert. For more details, see the [Dependabot alerts](https://docs.github.com/en/code-security/dependabot/dependabot-alerts/viewing-and-updating-dependabot-alerts). Closes scylladb/scylladb#23918 Fixes #23935	2025-04-27 18:45:11 +03:00
Piotr Szymaniak	e588c8667f	alternator: Limit attribute name lengths Attribute names are now checked against DynamoDB-compatible length limits. When exceeded, Alternator emits exception identical or similar to the DDB one. It might be worth noting that DDB emits more than a single kind of an exception string for some exceptions. The tests' catch clauses handle all the observed kinds of messages from DynamoDB. The validation differentiates between key and non-key attributes and applies the limit accordingly. AWS DDB raises exceptions with somewhat different contents when the get request contains ProjectionExpression, so this case needed separate treatment to emit the corresponding exception string. The length-validating function was declared and defined in expressions.hh/.cc respectively, because that's where the relevant parsing happens. ** Tests The following tests were validated when handling this issue: test_limit_attribute_length_nonkey_good, test_limit_attribute_length_nonkey_bad, test_limit_attribute_length_key_good, test_limit_attribute_length_key_bad, test_limit_attribute_length_gsi_lsi_good, test_limit_attribute_length_gsi_lsi_bad, test_limit_attribute_length_gsi_lsi_projection_bad. Some of the tests were expanded into being more granular. Namely, there is a new test function `test_limit_attribute_length_key_bad_incoherent_names` which groups tests with too long attribute names in the case of incorrect (incoherent) user requests. Similarily, there is a new test function `test_limit_attribute_length_gsi_lsi_bad_incoherent_names` All the tests cover now each combination of the key/keys being too long. Both the new fuctions contain tests that verify that ScyllaDB throws length-related exceptions (instead of the coherency-related), similar to what DynamoDB does. The new test test_limit_gsiu_key_len_bad covers the case of too long attribute name inside GlobalSecondaryIndexUpdates. The new test test_limit_gsiu_key_len_bad_incoherent_names covers the case of incorrect (incoherent) user requests containing too long attribute names and GlobalSecondaryIndexUpdates. test_limit_attribute_length_key_bad was found to have contaned an illegal KeySchema structure. Some of the tests were corrected their match clause. All the tests are stripped of the xfail flag except test_limit_attribute_length_key_bad, which has it changed since it still fails due to Projection in GSI and LIS not implemented in Alternator. The xfail now points to #5036. Fixes scylladb/scylladb#9169 Closes scylladb/scylladb#23097	2025-04-27 18:39:20 +03:00
Piotr Dulikowski	82e1678fbe	test: mv: skip test_mv_tablets_empty_ip in debug mode This test shuts down a node and then replaces it with another one while continuously writing to the cluster. The test has been observed to take a lot of time in debug mode and time out on the replace operation. Replace takes very long because rebuilding tablets on the new node is very slow, and the slowest part is memtable flush which happens at the beginning of streaming. The slowness seems to be specific to the debug mode. Turn off the test in debug mode to deflake the CI. As a follow-up, the test is planned to be reworked into an quicker error injection test so that the code path tested by this test will be again exercised in debug unit tests (scylladb/scylladb#23898) Fixes: scylladb/scylladb#20316 Closes scylladb/scylladb#23900	2025-04-27 18:06:08 +03:00
Piotr Dulikowski	670a69007e	test: cluster: add test_bad_initial_token Adds a test which checks that rollback works properly in case when a bad value of the initial_token function is provided.	2025-04-25 12:25:15 +02:00
Piotr Dulikowski	845cedea7f	topology coordinator: do not proceed further on invalid boostrap tokens In case when dht::boot_strapper::get_boostrap_tokens fail to parse the tokens, the topology coordinator handles the exception and schedules a rollback. However, the current code tries to continue with the topology coordinator logic even if an exception occurs, leaving boostrap_tokens empty. This does not make sense and can actually cause issues, specifically in prepare_and_broadcast_cdc_generation_data which implicitly expect that the bootstrap_tokens of the first node in the cluster will not be empty. Fix this by adding the missing break. Fixes: scylladb/scylladb#23897	2025-04-25 11:30:01 +02:00
Piotr Dulikowski	66acaa1bf8	cdc: add sanity check for generating an empty generation It doesn't make sense to create an empty CDC generation because it does not make sense to have a cluster with no tokens. Add a sanity check to cdc::make_new_generation_description which fails if somebody attempts to do that (i.e. when the set of current tokens + optionally bootstrapping node's tokens is empty). The function does not work correctly if it is misused, as we saw in scylladb/scylladb#23897. While the function should not be misused in the first place, it's better to throw an exception rather than crash - especially that this crash could happen on the topology coordinator.	2025-04-25 11:25:07 +02:00
Aleksandra Martyniuk	76cd707b18	test: test_tablets: wait for cql Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level to prevent failing queries. Fixes: #23620. Closes scylladb/scylladb#23796	2025-04-24 21:25:29 +03:00
Patryk Jędrzejczak	2a8bb47cfb	test: test_zero_token_nodes_topology_ops: use host IDs for ignored nodes Providing IP of an ignored node during removenode made the test flaky. It could happen that the address map contained mappings of two nodes with the same IP: 1. the node being ignored, 2. the node that expectedly failed replacing earlier in the test. So, `address_map::find_by_addr()` called in `find_raft_nodes_from_hoeps` could return the host ID of the second node instead of the first node and cause removenode to fail. We fix flakiness in this patch by providing the host ID of the ignored node instead of its IP. We would have to do it anyway sooner or later because providing IP is deprecated. The bug in `find_raft_nodes_from_hoeps` is tracked by scylladb/scylladb#23846. The test became flaky because of `f0af3f261e`. That patch is not present in 2025.1, so the test isn't flaky outside master, and hence there is no reason to backport this patch. Fixes scylladb/scylladb#23499 Closes scylladb/scylladb#23863	2025-04-24 20:17:19 +03:00
Pavel Emelyanov	68a178eba9	Merge 'replica: skip flush of dropped table' from Aleksandra Martyniuk Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead. Fixes: #16095. Needs backport to 2025.1 and 6.2 as they contain the bug Closes scylladb/scylladb#23876 * github.com:scylladb/scylladb: test: test table drop during flush replica: skip flush of dropped table	2025-04-24 20:02:59 +03:00
Andrei Chekun	22ef09489d	test.py: add awareness of extra_scylla_cmdline_options test_config.yaml can have field extra_scylla_cmdline_options that previously was not added to the commandline to start Scylla. Now any extra options will be added to commandline to start tests	2025-04-24 14:05:50 +02:00
Andrei Chekun	2758c4a08e	test.py: increase timeout for C++ tests in pytest Current timeouts it not enough. Tests failed randomly with hitting timeout. This will allow to test finish normally. As a downside if the process will hang we will be waiting more. This adjustments will be changed after we will have metrics how long it takes to test to pass in each mode.	2025-04-24 14:05:50 +02:00
Andrei Chekun	f5c88e1107	test.py: switch method of finding the root repo directory Switching to use constant defined in __init__ filet instead of getting the root directory from pytest's config. This is will allow to have only one source of truth in defining the root directory of the project to avoid cases when root directory defined incorrectly. This change also simplifies potential changes in future.	2025-04-24 14:05:50 +02:00
Andrei Chekun	06eca04370	test.py: move get_combined_tests to the correct facade Since get_combined_tests method is used only for boost tests and not all C++ tests, moving it into the correct place	2025-04-24 14:05:49 +02:00
Andrei Chekun	8cc9c0a53a	test.py: add common directory for reports When test.py executing python test it executes it by mode and by file, so it can say where the report should with mode. With new approach pytest will execute the tests for all modes inside himself, and we can only have one report per pytest invocation. That's why we need common directory for reports and not under the mode directory. It can later be used for simplification, so any report should be there.	2025-04-24 14:05:49 +02:00
Andrei Chekun	b791af1f16	test.py: add the possibility to provide additional env vars This will allow inject any environment variable to the test, because previosly it was taking only the environment variables from the process. Adding injecting ASAN and UBSAN variablet to the tests	2025-04-24 14:05:49 +02:00
Andrei Chekun	3cb5838619	test.py: move setup cgroups to the generic method This changes needed for later integration for pytest executing the C++ tests to be able to gather resource metric.	2025-04-24 14:05:49 +02:00
Andrei Chekun	ca615af407	test.py: refactor resource_gather.py Refactor resource_gather.py to not create the initial cgroup when the process it's already in it. This will allow not going deeper, creating again and again the same cgroup with each test.py execution when the terminal isn't closed. Add creation of own event loop in case it's not exists. This needed to be able to work with test.py that creates loop and with pytest that not create loop.	2025-04-24 14:05:49 +02:00
Wojciech Mitros	ee5883770a	test: remove flakiness from test_schema_is_recovered_after_dying Due to the changes in creating schemas with base info the test_schema_is_recovered_after_dying seems to be flaky when checking that the schema is actually lost after 'grace_period'. We don't actually guarantee that the the schema will be lost at that exact moment so there's no reason to test this. To remove the flakiness, we remove the check and the related sleep, which should also slightly improve the speed of this test.	2025-04-24 01:09:35 +02:00
Wojciech Mitros	bf7bba9634	mv: add a test for dropping an index while it's building Dropping an index is a schema change of its base table and a schema drop of the index's materialized view. This combination of schema changes used to cause issues during view building, because when a view schema was dropped, it wasn't getting updated with the new version of the base schema, and while the view building was in progress, we would update the base schema for the base table mutation reader and try generating updates with a view schema that wasn't compatible with the base schema, failing on an `on_internal_error`. In this patch we add a test for this scenario. We create an index, halt its view building process using an injection, and drop it. If no errors are thrown, the test succeeds. The test was failing before https://github.com/scylladb/scylladb/pull/23337 and is passing afterwards.	2025-04-24 01:09:32 +02:00
Wojciech Mitros	d77f11d436	base_info: remove the lw_shared_ptr variant The base_dependent_view_info is no longer needed to be shared or modified in the view_info, so we no longer need to keep it as a shared pointer.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	d7bd86591e	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	ea462efa3d	base_info: remove base_info snapshot semantics The base info in view schemas no longer changes on base schema updates, so saving the base info with a view schema from a specific point in time doesn't provide any additional benefits. In this patch we remove the code using the base_and_view snapshots as it's no longer useful.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	ad55935411	base_info: remove base schema from the base_info The base info now only contains values which are not reliant on the base schema version. We remove the the base schema from the base info to make it immutable regardless of base schema version, at the point of this patch it's also not needed anywhere - the new base info can replace the base schema in most places, and in the few (view_updates) where we need it, we pull the most recent base schema version from the database. After this change, the base info no longer changes in a view schema after creation, so we'll no longer get errors when we try generating view updates with a base_info that's incompatible with a specific base schema version. Fixes #9059 Fixes #21292 Fixes #22410	2025-04-24 01:08:39 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Wojciech Mitros	6e539c2b4d	base_info: make members non-const In the following patches we'll add the base info instead of the base schema to various places (schema building, schema registry). There, we'll sometimes need to update the base_info fields, which we can't do with const members. There's also a place (global_schema_ptr) where we won't be able to use the base_info_ptr (a shared pointer to the base_info), so we can't just use the base_info_ptr everywhere instead. In this patch we unmark these members as const. In the following patches we'll remove the methods for changing the base_info in the view schema, so it will remain effectively const.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	32258d8f9a	view_info: move the base info to a separate header In the following commits the base_depenedent_view_info will be needed in many more places. To avoid including the whole db/view/view.hh or forward declaring (where possible) the base info, we move it to a separate header which can be included anywhere at almost no cost.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	a3d2cd6b5e	view_info: move computation of view pk columns not in base pk to view_updates In preparation of making the base_info immutable, we want to get rid of any base_dependent_view_info fields that can change when base schema is updated. The _base_regular_columns_in_view_pk and _base_static_columns_in_view_pk base column_ids of corresponding base columns and they can change (decrease) when an earlier column is dropped in the base table. view_updates is the only location where these values are used and calculating them is not expensive when comparing to the overall work done while performing a view update - we iterate over all view primary key columns and look them up in the base table. With this in mind, we can just calculate them when creating a view_updates object, instead of keeping them in the base_info. We do that in this patch.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	a33963daef	view_info: move base-dependent variables into base_info The has_computed_column_depending_on_base_non_primary_key and is_partition_key_permutation_of_base_partition_key variables in the view_info depend on the base table so they should be in the base_dependent_view_info instead of view_info.	2025-04-24 01:08:39 +02:00
Wojciech Mitros	900687c818	view_info: set base info on construction Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as well as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). The first step towards that is making sure that all newly created schemas have the base info set. We achieve that by requiring a base schema when constructing a view schema. Unfortunately, this adds complexity each time we're making a view schema - we need to get the base schema as well. In most cases, the base schema is already available. The most problematic scenario is when we create a schema from mutations: - when parsing system tables we can get the schema from the database, as regular tables are parsed before views - when loading a view schema using the schema loader tool, we need to load the base additionally to the view schema, effectively doubling the work - when pulling the schema from another node - in this case we can only get the current version of the base schema from the local database Additionally, we need to consider the base schema version - when we generate view updates the version of the base schema used for reads should match the version of the base schema in view's base info. This is achieved by selecting the correct (old or new) schema in `db::schema_tables::merge_tables_and_views` and using the stored base schema in the schema_registry.	2025-04-24 01:08:39 +02:00
Benny Halevy	f279625f59	test_tablets_cql: test_alter_dropped_tablets_keyspace: extend expected error The query may fail also on a no_such_keyspace exception, which generates the following cql error: ``` Error from server: code=2200 [Invalid query] message="Can\'t find a keyspace test_1745198244144_qoohq" ``` Extend the pytest.raises match expression to include this error as well. Fixes #23812 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23875	2025-04-23 18:54:22 +03:00
Benny Halevy	2bbdaeba1c	Update seastar submodule * seastar e44af9b0...d7ff58f2 (2): > rpc: client: support timeout and cancellation > doc/io-properties-file.md: correct a typo Closes scylladb/scylladb#23865	2025-04-23 16:10:51 +03:00
Aleksandra Martyniuk	c1618c7de5	test: test table drop during flush	2025-04-23 14:29:28 +02:00
Aleksandra Martyniuk	91b57e79f3	replica: skip flush of dropped table	2025-04-23 14:29:28 +02:00
Kefu Chai	0d7752b010	build: cmake: generalize update_cxx_flags() Refactor our CMake flag handling to make it more flexible and reduce repetition: - Rename update_cxx_flags() to update_build_flags() to better reflect its expanded purpose - Generate CMake variable names internally based on configuration type instead of requiring callers to specify full variable names - Follow CMake's standard naming conventions for configuration-specific flags, see https://cmake.org/cmake/help/latest/variable/CMAKE_LANG_FLAGS.html#variable:CMAKE_%3CLANG%3E_FLAGS - Prepare groundwork for handling linker flags in addition to compiler flags in future changes Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23842	2025-04-23 12:06:04 +03:00
Nadav Har'El	64a5eee6b9	test/cqlpy: insert test names into Scylla logs Both test.py and test/cqlpy/run run many test functions against the same Scylla process. In the resulting log file, it is hard to understand which log messages are related to which test. In this patch, we log a message (using the "/system/log" REST API) every time a test is started or ends. The messages look like this: INFO 2025-04-22 15:10:44,625 [shard 1:strm] api - /system/log: test/cqlpy: Starting test_lwt.py::test_lwt_missing_row_with_static ... INFO 2025-04-22 15:10:44,631 [shard 0:strm] api - /system/log: test/cqlpy: Ended test_lwt.py::test_lwt_missing_row_with_static We already had a similar feature in test/alternator, added three years ago in commit `b0371b6bf8`. The implementation is similar but not identical due to different available utility functions, and in any case it's very simple. While at it, this patch also fixes the has_rest_api() to timeout after one second. Without this, if the REST API is blocked in a way that a connection attempt just hangs, the tests can hang. With the new timeout, the test will hang for a second, realize the REST API is not available, and remember this decision (the next tests will not wait one second again). We had the same bug in Alternator, and fixed it in `758f8f01d7`. This one second "pause" will only happen if the REST API port is blocked - in the more typical case the REST API port is just not listening but not blocked, and the failure will be noticed immediately and won't wait a whole second. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23857	2025-04-23 12:04:14 +03:00
Piotr Dulikowski	3d73c79a72	test: mv: skip test_view_building_scheduling_group in debug The test populates a table with 50k rows, creates a view on that table and then compares the time spent in streaming vs. gossip scheduling groups. It only takes 10s in dev mode on my machine, but is much slower in debug mode in CI - building the view doesn't finish within 2 minutes. The bigger the view to build, the more accurrate the measurement; moreover, the test scenario isn't interesting enough to be worth running it in debug mode as this should be covered by other tests. Therefore, just skip this test in debug mode. Fixes: scylladb/scylladb#23862 Closes scylladb/scylladb#23866	2025-04-23 11:29:35 +03:00
Pavel Emelyanov	a6ba535c3c	Merge 'test.py: refactoring before boost pytest integration' from Andrei Chekun This PR contains changes that do not add new functionality, and have small refactoring of the existing code. The most significant change though is switching the SQLite writer from a singleton to a thread locking mechanism that will be needed later on. This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer [request](https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278). Closes scylladb/scylladb#23867 * github.com:scylladb/scylladb: test.py: move the readme file for LDAP tests to the correct location test.py: eliminate deprecation warning for xml.etree.ElementTree.Element test.py: align the behavior of max-failures parameter with pytest maxfail test.py: fix typo in toxiproxy name parameter test.py: add locking to the sqlite writer for resource gather test.py: add sqlite datetime adapter for resource gather test.py: change the parameter for get_modes_to_run()	2025-04-23 11:10:56 +03:00
Andrzej Jackowski	3c69340b8c	test: add test_long_query_timeout_erm This commit adds a test to verify that a query with long timeout doesn't block ERM on failure. The motivation for the test is fixing scylladb#21831. This commit: - add test_long_query_timeout_erm	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	1f1e4f09cd	test: add get_cql_exclusive to manager_client.py This commit adds to ManagerClient a get_cql_exclusive function that allows creating a cql connection with WhiteListRoundRobinPolicy for a single server. Such connection is useful in tests that kill nodes to make sure that the live node handles the queries. Before this commit, some tests used cluster_con from test/cluster/conftest.py, and after this commit test can start to use a method from MangerClient. This change: - Extend ManagerClient con_gen type to allow LoadBalancingPolicy arg - Implement get_cql_exclusive()	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	9d53063a7e	mapreduce: catch local read_failure_exception_with_timeout Mapreduce Service exception handling differs for local and remote RPC calls of dispatch_to_shards. Whereas local exceptions are handled normally, the remote exceptions are converted to rpc::remote_verb_error by the framework. This is a substantial difference when read_failure_exception_with_timeout is thrown during mapreduce query execution - CQL server waits for the exception from the local call but not from the remote one. As we don't want to wait for the timeout in CQL server in either of the cases, this commit catches the local exception (especially read_failure_exception_with_timeout) and converts it to std::runtime_error (the one from which rpc::remote_verb_error inherits). Ideally, Mapreduce Service should execute dispatch_to_shards through RPC for both local and remote calls. However, such change negatively affects tens of Unit Tests that rely on the possibility to run local mapreduce service without any RPC. This change: - Catch local exceptions in Mapreduce Service and convert them to std::runtime_error.	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	1fca994c7b	transport: storage_proxy: release ERM when waiting for query timeout Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Add new read_failure_exception_with_timeout exception - Add throw of read_failure_exception_with_timeout in storage_proxy - Add abort_source to CQL server, as well as to_stop() method for the correct abort handling - Add sleep in CQL server when the new exception is caught Refs #21831	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	9b1f062827	transport: remove redundant references in process_request_one The references were added and used in previous commits to limit the number of line changes for a reviewer convenience. This commit removes the redundant references to make the code more clear and concise.	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	9c0f369cf8	transport: fix the indentation in process_request_one Fix the indentation after the previous commit that intentionally had a wrong indent to limit the number of changed lines	2025-04-23 09:29:47 +02:00
Andrzej Jackowski	8a7454cf3e	transport: add futures in CQL server exception handling Prepare for the next commit that will introduce a seastar::sleep in handling of selected exception. This commit: - Rewrite cql_server::connection::process_request_one to use seastar::futurize_invoke and try_catch<> instead of utils::result_try. - The intentation is intentionally incorrect to reduce the number of changed lines. Next commits fix it.	2025-04-23 09:29:05 +02:00
Andrei Chekun	57b66e6b2e	test.py: move the readme file for LDAP tests to the correct location README file was created in incorrect location, now it moved to the directory with source files where it intended to be.	2025-04-22 19:03:28 +02:00
Andrei Chekun	cf4747c151	test.py: eliminate deprecation warning for xml.etree.ElementTree.Element Testing the truth value of an Element emits DeprecationWarning. This check is done correctly	2025-04-22 19:03:21 +02:00
Andrei Chekun	bc49cd5214	test.py: align the behavior of max-failures parameter with pytest maxfail This will allow to just transfer the existing max-failures values to the pytest without any modification. As a downside test.py logic of handling these changes slightly.	2025-04-22 19:03:08 +02:00
Andrei Chekun	5c3501e4bf	test.py: fix typo in toxiproxy name parameter Fix typo in toxiproxy name parameter. No any functional changes just cosmetic fix.	2025-04-22 19:02:12 +02:00
Andrei Chekun	2c37a793d1	test.py: add locking to the sqlite writer for resource gather SQLite blocking the DB during writes, so it's not possible to make writes from several thread. To be able to gather metrics in several threads, we need a locking mechanism for threads during writes. So thread will not try to write metrics while another thread is performing writes.	2025-04-22 19:01:30 +02:00
Andrei Chekun	800710dc2c	test.py: add sqlite datetime adapter for resource gather Add sqlite datetime adapter for resource gather since default adapters are deprecated from 3.12	2025-04-22 18:59:49 +02:00
Andrei Chekun	bf2a9e267e	test.py: change the parameter for get_modes_to_run() Change the parameter for get_modes_to_run() from session to config to narrow the scope, and prepare it to later use in method that do not have access to the session, but have access to the config object	2025-04-22 18:58:33 +02:00
Kefu Chai	7254c0c515	db/config.cc: correct a typo in option's description s/incomming/incoming/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23826	2025-04-22 16:55:04 +03:00
Pavel Emelyanov	65efd2b2f6	Merge 'Refactor and enhance s3_tests' from Ernest Zaslavsky This PR introduces a cleanup mechanism in s3_tests to remove uploaded objects after the test completes, ensuring a clean testing environment. Additionally, the recently added test has been refactored and split into smaller, more maintainable parts, improving readability and extending its coverage to include the "proxied" case. As these changes primarily improve code aesthetics and maintainability, backporting is not necessary. Refs: https://github.com/scylladb/scylladb/issues/23830 Closes scylladb/scylladb#23828 * github.com:scylladb/scylladb: s3_tests: Improve and extend copy object test coverage s3_tests: Implement post-test cleanup for uploaded objects	2025-04-22 16:40:37 +03:00
Nadav Har'El	5fd2eabd48	Merge 'Generalize the diversity of parse_table_infos() callers in API' from Pavel Emelyanov The helper in question is used in several different ways -- by handlers directly (most of the callers), as a part of wrap_ks_cf() helper and by one of its overloads that unpack the "cf" query parameter from request. This PR generalizes most of the described callers thus reducing the number differently-looking of ways API handlers parse "keyspace" and "cf" request parameters. Continuation of #22742 Closes scylladb/scylladb#23368 * github.com:scylladb/scylladb: api: Squash two parse_table_infos into one api: Generalize keyspaces:tables parsing a little bit more api: Provide general pair<keyspace, vector<table>> parsing api: Remove ks_cf_func and related code	2025-04-22 15:40:06 +03:00
Nadav Har'El	8d1a413357	test/scylla_gdb: better error message when running on dev build mode The test/scylla_gdb suite needs Scylla to have been built with debug symbols - which is NOT the case for the dev build. So the script test/scylla_gdb/run attempts to recognize when a developer runs it on an executable with the debug symbols missing - and prints a clear error. Unfortunately, as we noticed in #10863, and again in #23832, because wasmtime is compiled with debug symbols and linked with Scylla, build/dev/scylla "pretends" to have debug symbols, foiling the check in test/scylla_gdb/run. Reviewers rejected two solutions to this problem (pull requests #10865 and #10923), so in pull request #10937 I added a cosmetic solution just for test/scylla_gdb: in test/scylla_gdb/conftest.py we check that there are really debug symbols that interest us, and if not, exit immediately instead of failing each test separately. For some reason, the sys.exit() we used is no longer effective - it no longer exits pytest, so in this patch we use pytest.exit() instead. Fixes #23832 (sort of, we leave build/dev/scylla with the fake claim that it has debug symbols, but test/scylla_gdb will handle this situation more gracefully). Closes scylladb/scylladb#23834	2025-04-22 15:02:06 +03:00
Michael Litvak	5c1d24f983	test: test_mv_topology_change: increase timeout for remove_node The test `test_mv_write_to_dead_node` currently uses a timeout of 60 seconds for remove_node, after it was increased from 30 seconds to fix scylladb/scylladb#22953. Apparently it is still too low, and it was observed to fail in debug mode. Normally remove_node uses a default timeout of TOPOLOGY_TIMEOUT = 1000 seconds, but the test requires a timeout which is shorter than 5 minutes, because it is a regression test for an issue where MV updates hold topology changes for more than 5 minutes, and we want to verify in the test that the topology change completes in less than 5 minutes. To resolve the issue, we set the test to skip in debug mode, because the remove node operation is unpredictably slow, and we increase the timeout to 180 seconds which is hopefully enough time for remove_node in non-debug modes, and still sufficient to satisfy the test requirements. Fixes scylladb/scylladb#22530 Closes scylladb/scylladb#23833	2025-04-22 10:51:19 +02:00
Kefu Chai	a2b46cbf45	sstables_loader: fix the indent Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-04-22 12:05:55 +08:00
Kefu Chai	6b3ecad467	sstables_loader: fix the racing between get_progress() and release_resources() This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them. Problem: - Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data) - `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard` - As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard` - If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory. Solution: - Implemented shared/exclusive locking to protect access to both state and sharded progress data - Multiple `get_progress()` calls can execute in parallel (shared access) - `release_resources()` acquires exclusive access before modifying resources - This prevents potential memory corruption and ensures consistent progress reporting Fixes #23801 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-04-22 12:05:54 +08:00
Ernest Zaslavsky	edaa3f4bdd	s3_tests: Improve and extend copy object test coverage Refactored the copy object test to enhance readability and maintainability. The test was simplified and split into smaller, more focused parts. Additionally, a "proxied" variant of the test was introduced to expand coverage.	2025-04-21 20:54:14 +03:00
Ernest Zaslavsky	252a0a14af	s3_tests: Implement post-test cleanup for uploaded objects Ensure cleanup after tests by deleting objects uploaded to MinIO. This improves resource management and maintains a clean test environment.	2025-04-21 20:54:14 +03:00
Avi Kivity	2dcd2b21ae	Merge 'tablets: Equalize per-table balance when allocating tablets for a new table' from Tomasz Grabiec Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation as in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631 Backport to 2025.1 because load imbalance is a serious problem in production. Closes scylladb/scylladb#23708 * github.com:scylladb/scylladb: tablets: Equalize per-table balance when allocating tablets for a new table load_sketch: Tolerate missing tablet_map when selecting for a given table tests: tablets: Simplify tests by moving common code to topology_builder	2025-04-21 17:06:30 +03:00
Pavel Emelyanov	eb5b52f598	Merge 'main: make DC and rack immutable after bootstrap' from Piotr Dulikowski Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278 Fixes: scylladb/scylladb#22869 Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga Closes scylladb/scylladb#23800 * github.com:scylladb/scylladb: doc: changing topology when changing snitches is no longer supported test: cluster: introduce test_no_dc_rack_change storage_service: don't update DC/rack in update_topology_with_local_metadata main: make dc and rack immutable after bootstrap test: cluster: remove test_snitch_change	2025-04-21 15:52:55 +03:00
Yaniv Michael Kaul	b374f94b15	pip installation: use --no-cache-dir There are two reasons we may want NOT to use caching of pip deps: 1. When building a container, unless we specifically clean it up, it'll remain, even when we squash the image layers later. 2. When building a container, that cache is not useful, as we squash our containers later (so that layer is not cached really). And our CI cleans up the layers repo anyway. 3. Caching sometimes isn't great, and doesn't ensure we pick up the exact version (or latest) that we wish to... This PR changes two locations in Scylla, both of which (also) build containers, so certainly relevant for 1, 2 above and possibly 3. No real need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#23822	2025-04-21 13:46:57 +03:00
Avi Kivity	0ba3ce1741	test: gdb: avoid using `file(1)` to determine if debug information is present The scylla_gdb tests verify, as a sanity check, that the executable was built with debug information. They do so via file(1). In Fedora 42, file(1) crashes on ELF files that have interpreter pathnames larger than 128 characters[1]. This was later fixed[2], but the fix is not in any release. Work around the problem by using objdump instead of file. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2354970 [2] `b3384a1fbf` Closes scylladb/scylladb#23823	2025-04-21 13:29:27 +03:00
Andrei Chekun	441cee8d9c	test.py: fix gathering logs in case of fail Currently log files have information about run_id twice: cluster.object_store_test_backup.10.test_abort_restore_with_rpc_error.dev.10_cluster.log However, sometimes the first run_id can be incorrect: cluster.object_store_test_backup.1.test_abort_restore_with_rpc_error.dev.10_cluster.log Removing first run_id in the name to not face this issue and because it's actually redundant. Removing creation empty file for scylla manager log, since it redundant and was done as incorrect assumption on the root cause of the fail. Add extension to the stacktrace file, so it will be opened in the browser in Jenkins in the new tab instead of downloading it. Fixes: https://github.com/scylladb/scylladb/issues/23731 Closes scylladb/scylladb#23797	2025-04-21 13:12:35 +03:00
Pavel Emelyanov	09caad6147	test: Remove sstable_assertions::get_stats_metadata() It mirrors the sstable method of the same name, which is public. With -> operator, it's just as convenient to call it directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-04-18 18:53:41 +03:00
Pavel Emelyanov	294e56207d	test: Add sstable_assertions::operator->() ... and replace get_sstable() with it. It's more natural (despite having the only user) to consider the class to be yet another "pointer" to an sstable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-04-18 18:52:39 +03:00
Sergey Zolotukhin	2314feeae2	test: Ignore DEBUG,TRACE,INFO level messages when checking for failed mutations. Update the regular expression in `check_node_log_for_failed_mutations` to avoid false test failures when DEBUG-level logging is enabled. Fixes scylladb/scylladb#23688 Closes scylladb/scylladb#23658	2025-04-18 16:17:41 +03:00
Calle Wilund	4a44651fce	encryption_at_rest_test: Make fake_proxy read/write loop noexcept Fixes #23774 Test code falls into same when_all issue as http client did. Avoid passing exceptions through this, and instead catch and report in worker lambda. Closes scylladb/scylladb#23778	2025-04-18 16:17:41 +03:00
Pavel Emelyanov	324daac156	Merge 'Add CopyObject API implementation to S3 client' from Ernest Zaslavsky Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go No need to backport since we are adding new functionality for a future use Closes scylladb/scylladb#23779 * github.com:scylladb/scylladb: s3_client: implement S3 copy object s3_client: improve exception message s3_client: reposition local function for future use	2025-04-18 16:17:41 +03:00
Pavel Emelyanov	cc919b08c2	Merge 'backup: Optimize S3 throughput with shard-based upload' from Ernest Zaslavsky This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads. To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations. Refs #22460 fixes: #22520 ``` =========================================== Release build, master, smp-16, mem-32GiB Bytes: 2342880184, backup time: 9.51 s =========================================== Release build, this PR, smp-16, mem-32GiB Bytes: 2342891015, backup time: 1.23 s =========================================== ``` Looks like it is faster at least x7.7 No backport needed since it (native backup) is still unused functionality Closes scylladb/scylladb#23727 * github.com:scylladb/scylladb: backup: Add test for invalid endpoint backup_task: upload on all shards backup_task: integrate sharded storage manager for upload	2025-04-18 16:17:41 +03:00
Avi Kivity	6b415cfd4b	Merge 'managed_bytes: in the copy constructor, respect the target preferred allocation size' from Michał Chojnowski Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781 This is a regression fix, should be backported to all affected releases. Closes scylladb/scylladb#23782 * github.com:scylladb/scylladb: managed_bytes_test: add a reproducer for #23781 managed_bytes: in the copy constructor, respect the target preferred allocation size	2025-04-17 21:14:10 +03:00
Pavel Emelyanov	ca2cc5e826	Merge 'test/cluster/test_read_repair: make incremental test work with tablets' from Botond Dénes There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes: * Make the tests use tablets by creating the test keyspace with tablets. * Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE. * Remove the skip added to the partition-tombstone test variant. Fixes: #21179 Test improvement, no backport required. Closes scylladb/scylladb#23167 * github.com:scylladb/scylladb: wip test/cluster/test_read_repair: make incremental test work with tablets	2025-04-17 18:54:00 +03:00
Piotr Dulikowski	325a89638c	doc: changing topology when changing snitches is no longer supported Update the "How to Switch Snitches" document to indicate that changing topology (i.e. changing node's DC or rack) while changing the snitch is no longer supported. Remove a note which said that switching snitches is not supported with tablets. It was introduced because of the concern that switching a snitch might change DC or rack of the node, for which our current tablet load balancer is completely unprepated. Now that changing DC/rack is forbidden, there doesn't seem to be anything related to snitches which could cause trouble for tablets.	2025-04-17 16:22:58 +02:00
Piotr Dulikowski	796c8d1601	test: cluster: introduce test_no_dc_rack_change The test makes sure that changing the DC or rack in the snitch's configuration fails with an expected error.	2025-04-17 16:22:58 +02:00
Piotr Dulikowski	1791ae3581	storage_service: don't update DC/rack in update_topology_with_local_metadata The DC/rack are now immutable and cannot be changed after restart, so there is no need to update the node's system.topology entry with this information on restart.	2025-04-17 16:22:58 +02:00
Piotr Dulikowski	ce2fab7cce	main: make dc and rack immutable after bootstrap Changing DC or rack on a node which was already bootstrapped is, in case of vnodes, very unsafe (almost guaranteed to cause data loss or unavailability), and is outright not supported if the cluster has a tablet-backed keyspaces. Moreover, the possibility of doing that makes it impossible to uphold some of the invariants promised by the RF-rack-valid flag, which is eventually going to become unconditionally enabled. Get rid of the above problems by removing the possibility of changing the DC / rack of a node. A node will now fail to start if its snitch reports a different DC or rack than the one that was reported during the first boot. Fixes: scylladb/scylladb#23278	2025-04-17 16:22:26 +02:00
Tomasz Grabiec	1e407ab4d2	tablets: Equalize per-table balance when allocating tablets for a new table Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631	2025-04-17 16:01:23 +02:00
Tomasz Grabiec	2597a7e980	load_sketch: Tolerate missing tablet_map when selecting for a given table To simplify future usage in network_topology_strategy::add_tablets_in_dc() which invokes populate() for a given table, which may be both new and preexisitng.	2025-04-17 16:01:16 +02:00
Ernest Zaslavsky	b79ca5a1aa	backup: Add test for invalid endpoint * During the development phase, the backup functionality broke because we lacked a test that runs backup with an invalid endpoint. This commit adds a test to cover that scenario. * Add checking for the expected error to be propagated from failing/aborted backup	2025-04-17 16:31:43 +03:00
Benny Halevy	b7212620f9	backup_task: upload on all shards Use all shards to upload snapshot files to S3. By using the sharded sstables_manager_for_table infrastructure. Refs #22460 Quick perf comparison =========================================== Release build, master, smp-16, mem-32GiB Bytes: 2342880184, backup time: 9.51 s =========================================== Release build, this PR, smp-16, mem-32GiB Bytes: 2342891015, backup time: 1.23 s =========================================== Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>	2025-04-17 16:31:42 +03:00
Piotr Dulikowski	dd2e507ece	test: cluster: remove test_snitch_change This test checked that it is possible to change DC/rack of a node during restart. This will become explicitly forbidden, so remove the test.	2025-04-17 13:51:22 +02:00
Aleksandra Martyniuk	e178bd7847	test: add test for getting tasks children Add test that checks whether the children of a virtual task will be properly gathered if a node is down.	2025-04-17 13:48:44 +02:00
Aleksandra Martyniuk	53e0f79947	tasks: check whether a node is alive before rpc Check whether a node is alive before making an rpc that gathers children infos from the whole cluster in virtual_task::impl::get_children.	2025-04-17 12:51:22 +02:00
Michał Chojnowski	6c1889f65c	managed_bytes_test: add a reproducer for #23781	2025-04-17 12:51:01 +02:00
Botond Dénes	8ac7c54d8b	Merge 'topology_coordinator: stop: await all background_action_holder:s' from Benny Halevy Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 * The issue exists since 6.2 Closes scylladb/scylladb#17712 * github.com:scylladb/scylladb: topology_coordinator: stop: await all background_action_holder:s topology_coordinator: stop: improve error messages topology_coordinator: stop: define stop_background_action helper	2025-04-17 12:10:29 +03:00
Kefu Chai	b0cbe86780	s3/client: define a constant for security credential resource instead of repeating it, let's define a consstant and reuse it. less repeatings this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23713	2025-04-17 11:51:15 +03:00
Kefu Chai	a33651b03e	db, service: do not include unused header these unused headers were flagged by clang-include-cleaner. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23735	2025-04-17 11:49:59 +03:00
Botond Dénes	33e383c557	scripts/pull_github_pr.sh: add argument parsing Instead of hardcoding PR_NUM=$1 and FORCE=$2. This current setup is not very flexible and one gets no feedback if the arguments are incorrect or not recognized. Add proper position-independent argument parsing using a classic while case loop. Closes scylladb/scylladb#23623	2025-04-17 11:49:15 +03:00
Nadav Har'El	84d4af1f0e	Merge 'Alternator batch rcu' from Amnon Heiman This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator. It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes. Need backporting to 2025.1, as RCU and WCU are not fully supported Fixes #23690 Closes scylladb/scylladb#23691 * github.com:scylladb/scylladb: test_returnconsumedcapacity.py: test RCU for batch get item alternator/executor: Add RCU support for batch get items alternator/consumed_capacity: make functionality public	2025-04-17 10:08:16 +03:00
Botond Dénes	22a28ca1db	wip	2025-04-17 03:01:17 -04:00
Ernest Zaslavsky	a369dda049	s3_client: implement S3 copy object Add support for the CopyObject API to enable direct copying of S3 objects between locations. This approach eliminates networking overhead on the client side, as the operation is handled internally by S3.	2025-04-17 09:47:47 +03:00
Botond Dénes	19b4f10598	test/cluster/test_read_repair: make incremental test work with tablets There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes: * Make the tests use tablets by creating the test keyspace with tablets. * Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE. * Remove the skip added to the partition-tombstone test variant. Also add tracing to the read-repair query, to make debugging the test easier if it fails. Fixes: #21179	2025-04-17 02:01:17 -04:00
Michał Chojnowski	4e2f62143b	managed_bytes: in the copy constructor, respect the target preferred allocation size Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781	2025-04-16 22:06:06 +02:00
Nadav Har'El	6db666a1c1	replica: fix 10-second pause during shutdown As noticed in issue #23687, if we shut down Scylla while a paged read is in progress - or even a paged read that the client had no intention of ever resume it - the shutdown pauses for 10 seconds. The problem was the stop() order - we must stop the "querier cache" before we can close sstables - the "querier cache" is what holds paged readers alive waiting for clients to resume those reads, and while a reader is alive it holds on to sstables so they can't be closed. The querier cache's querier_cache::default_entry_ttl is set to 10 seconds, which is why the shutdown was un-paused after 10 seconds. This fix in this patch is obvious: We need to stop the querier cache (and have it release all the readers it was holding) before we close the sstables. Fixes #23687 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23770	2025-04-16 20:35:44 +03:00
Avi Kivity	0206da5232	Merge 'readers: strip "flat" and "v2" from names' from Botond Dénes Continue the effort of normalizing reader names, stripping legacy qualifying terms like "flat" and "v2". Flat and v2 readers are the default now, we only need to add qualifying terms to readers which are different than the normal. One such reader remains: `make_generating_reader_v1()`. This PR contains mostly mechanical changes, done with a sed script. Commits which only contain such mechanical renames are marked as such in the commitlog. Code cleanup, no backport needed. Closes scylladb/scylladb#23767 * github.com:scylladb/scylladb: readers: mv reversing_v2.hh reversing.hh readers: mv generating_v2.hh generating.hh tree: s/make_generating_reader_v2/make_generating_reader/ readers: mv from_mutations_v2.hh from_mutations.hh tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s readers: mv from_fragments_v2.hh from_fragments.hh readers: mv forwardable_v2.hh forwardable.hh readers: mv empty_v2.hh empty.hh tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ readers/empty_v2.hh: replace forward declarations with include of fwd header readers/mutation_reader_fwd.hh: forward declare reader_permit readers: mv delegating_v2.hh delegating.hh readers/delegating_v2.hh: move reader definition to _impl.hh file	2025-04-16 20:21:51 +03:00
Ernest Zaslavsky	8929cb324e	s3_client: improve exception message Clarify that the multipart upload was aborted due to a failure in parsing ETags.	2025-04-16 18:58:22 +03:00
Ernest Zaslavsky	993953016f	s3_client: reposition local function for future use The local function has been relocated higher in the code to prepare for its usage in upcoming implementations.	2025-04-16 18:46:31 +03:00
Ernest Zaslavsky	428f673ca2	backup_task: integrate sharded storage manager for upload Introduce the sharded storage manager and use it to instantiate upload clients. Full functionality will be implemented in subsequent changes.	2025-04-16 18:18:58 +03:00
Amnon Heiman	3acde5f904	test_returnconsumedcapacity.py: test RCU for batch get item This patch adds tests for consumed capacity in batch get item. It tests both the simple case and the multi-item, multi-table case that combines consistent and non-consistent reads.	2025-04-16 17:05:32 +03:00
Pavel Emelyanov	8b2cababb6	generic_server: Don't mess with db::config The db::config is top-level configuration of scylla, we generally try to avoid using it even in scylla components: each uses its own config initialized by the service creator out of the db::config itself. The generic_server is not an exception, all the more so, it already has its own config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23705	2025-04-16 17:02:30 +03:00
Amnon Heiman	88095919d0	alternator/executor: Add RCU support for batch get items This patch adds RCU support for batch get items. With batch requests, multiple objects are read from multiple tables. While the criterion for adding the units is per the batch request, the units are calculated per table—and so is the read consistency.	2025-04-16 16:53:22 +03:00
Amnon Heiman	0eabf8b388	alternator/consumed_capacity: make functionality public The consumed_capacity_counter is not completely applicable for batch operations. This patch makes some of its functionality public so that batch get item can use the components to decide if it needs to send consumed capacity in the reply, to get the half units used by the metrics and returned result, and to allow an empty constructor for the RCU counter.	2025-04-16 16:49:40 +03:00
Benny Halevy	7a0f5e0a54	topology_coordinator: stop: await all background_action_holder:s Add missing awaits for the rebuild_repair and repair background actions. Although the background actions hold the _async_gate which is closed in topology_coordinator::run(), stop() still needs to await all background action futures and handle any errors they may have left behind. Fixes #23755 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:23:02 +03:00
Benny Halevy	6de79d0dd3	topology_coordinator: stop: improve error messages "when cleanup" is ill-formed. Use "when XYZ" to "during XYZ" instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:58 +03:00
Benny Halevy	d624795fda	topology_coordinator: stop: define stop_background_action helper Refactor the code to use a helper to await background_action_holder and handle any errors by printing a warning. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-16 15:20:39 +03:00
Botond Dénes	6172ff501f	readers: mv reversing_v2.hh reversing.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	c8563b9604	readers: mv generating_v2.hh generating.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	dfd7f03463	tree: s/make_generating_reader_v2/make_generating_reader/ Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	c29c696780	readers: mv from_mutations_v2.hh from_mutations.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	b104862702	tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s Completely mechanical change.	2025-04-16 04:46:07 -04:00
Anna Stuchlik	0b4740f3d7	doc: add info about Scylla Doctor Automation to the docs Fixes https://github.com/scylladb/scylladb/issues/23642 Closes scylladb/scylladb#23745	2025-04-16 11:44:35 +03:00
Botond Dénes	7547d0c6a9	readers: mv from_fragments_v2.hh from_fragments.hh Completely mechanical change.	2025-04-16 04:35:00 -04:00
Botond Dénes	f1bd2553ed	readers: mv forwardable_v2.hh forwardable.hh Completely mechanical change.	2025-04-16 04:33:50 -04:00
Botond Dénes	a9d75c4f9d	readers: mv empty_v2.hh empty.hh Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	05829f98f3	tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	0e33f0d09e	readers/empty_v2.hh: replace forward declarations with include of fwd header	2025-04-16 04:12:08 -04:00
Botond Dénes	d75936d989	readers/mutation_reader_fwd.hh: forward declare reader_permit It is commonly used as parameter to reader factory methods.	2025-04-16 04:12:08 -04:00
Botond Dénes	7d9b91a00e	readers: mv delegating_v2.hh delegating.hh Completely mechanical change.	2025-04-16 04:11:55 -04:00
Botond Dénes	c7f68a2649	readers/delegating_v2.hh: move reader definition to _impl.hh file The idea behind readers/ is that each reader has its minimal header with just a factory method declaration. The delegating reader is defined in the factory header because it has a derived class in row_cache_test.cc. Move the definition to delegating_impl.hh so users not interested in deriving from it don't pay the price in header include cost.	2025-04-16 03:47:57 -04:00
Pavel Emelyanov	70ac5828a8	Update seastar submodule * seastar 099cf616...e44af9b0 (19): > Add assertion to `get_local_service` > http_client: Improve handling of server response parsing errors > util: include used header > core: Fix module linkage by using `inline constexpr` for shared constants > build: fix P2582R1 detection for GCC compiler compatibility > app-template: remove production warning > ioinfo: Extend printed data a bit more > reactor: Fix indentation after previous patch > reactor: Configure multiple mountpoints per disk > io_queue, resource, reactor: Rename dev_t -> unsigned > resource: Rename mountpoint to disk in resources > reactor: Keep queues as shared_ptr-s > io_queue: Drop device ID > io_intent: Use unsigned queue id as a key > io_queue: Keep unsigned queue id on an io_queue > file: Keep device_id on posix file impl > io_queue: Print mountpoint in latency goal bump message > io_intent: Rename qid to cid > reactor: Move engine()._num_io_groups assignment and check Changes in io-queue call for scylla-gdb update as well -- now the reactor map of device to io-queue uses seastar::shared_ptr, not std::unique_ptr. Closes scylladb/scylladb#23733	2025-04-16 09:44:37 +03:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Lakshmipathi	42ed6a87bf	test: Test truncate during topology change Add a new node, during topology change issue truncate call and verify all nodes empty data after tablet migration. Fixes: https://github.com/scylladb/scylla-dtest/issues/5317 Signed-off-by: Lakshmipathi Ganapathi <lakshmipathi.ganapathi@scylladb.com> Closes scylladb/scylladb#22595	2025-04-16 09:10:22 +03:00
Tomasz Grabiec	001d3b2415	Merge 'storage_service: preserve state of busy topology when transiting tablet' from Łukasz Paszkowski Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Unit test: Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Fixes https://github.com/scylladb/scylladb/issues/20073. Commit `876478b84f` was first released in scylla-6.0.0, so we might want to backport this patch accordingly. Closes scylladb/scylladb#23751 * github.com:scylladb/scylladb: storage_service: add unit test for mid-decommission transit_tablet() storage_service: preserve state of busy topology when transiting tablet	2025-04-16 00:19:24 +02:00
Pavel Emelyanov	b79137eaa4	storage_service: Use this->_features directly This dependency is already there, storage service doesn't need to go rounds via database reference to get to the features. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23739	2025-04-15 21:11:12 +03:00
Tomasz Grabiec	d493a8d736	tests: tablets: Simplify tests by moving common code to topology_builder Reduces code duplication.	2025-04-15 16:05:41 +02:00
Laszlo Ersek	841ca652a0	storage_service: add unit test for mid-decommission transit_tablet() Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 15:15:25 +02:00
Michał Chojnowski	b3d951517d	test/scylla_gdb: generate a coredump when coro_task fails This test fails sometimes, but rarely and unreliably. We want to get a coredump from it the next time it fails. Sending a SIGSEGV should induce that. Refs https://github.com/scylladb/scylladb/issues/22501 Closes scylladb/scylladb#23256	2025-04-15 15:16:38 +03:00
Calle Wilund	abd2d8a58b	test_tools: Manual merge of local key gen tool test from enterprise Fixes scylladb/scylla-enterprise#5358 Transposed tool test for local file generator, originally java test. Then enterprise test. Now here. Closes scylladb/scylladb#23726	2025-04-15 15:14:08 +03:00
Laszlo Ersek	e1186f0ae6	storage_service: preserve state of busy topology when transiting tablet Commit `876478b84f` ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise. Restrict the state change to when the topology state machine is idle. In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-04-15 13:44:45 +02:00
Piotr Dulikowski	22e3b8eccd	Merge 'test/cqlpy: Adjust tests to RF-rack-valid keyspaces' from Dawid Mędrek In this PR, we adjust tests in the cqlpy test suite so they only use RF-rack-valid keyspaces. After that, we enable the configuration option `rf_rack_valid_keyspaces` in the suite by default. Refs scylladb/scylladb#23428 Backport: backporting to 2025.1 so we can test the option there too. Closes scylladb/scylladb#23489 * github.com:scylladb/scylladb: test/cqlpy: Enable rf_rack_valid_keyspaces by default test: Move test_alter_tablet_keyspace_rf to cluster suite test/cqlpy: Adjust tests to RF-rack-valid keyspaces test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces	2025-04-15 12:43:11 +02:00
Avi Kivity	b4d4e48381	scylla-gdb: small-objects: fix for very small objects Because of rounding and alignment, there are multiple pools for small sizes (e.g. 4 for size 32). Because the pool selection algorithm ignores alignment, different pools can be chosen for different object sizes. For example, an object size of 29 will choose the first pool of size 32, while an object size of 32 will choose the fourth pool of size 32. The small-objects command doesn't know about this and always considers just the first pool for a given size. This causes it to miss out on sister pools. While it's possible to adjust pool selection to always choose one of the pools, it may eat a precious cycle. So instead let's compensate in the small-objects command. Instead of finding one pool for a given size, find all of them, and iterate over all those pools. Fixes #23603 Closes scylladb/scylladb#23604	2025-04-15 11:16:52 +03:00
Emil Maskovsky	3930ee8e3c	raft: fix data center remaining nodes initialization The `_remaining_nodes` attribute of the data center information was not initialized correctly. The parameter was passed by value to the initialization function instead of by reference or pointer. As a result, `_remaining_nodes` was left initialized to zero, causing an underflow when decrementing its value. This bug did not significantly impact behavior because other safeguards, such as capping the maximum voters per data center by the total number of nodes, masked the issue. However, it could lead to inefficiencies, as the remaining nodes check would not trigger correctly. Fixes: scylladb/scylladb#23702 No backport: The bug is only present in the master branch, so no backport is required. Closes scylladb/scylladb#23704	2025-04-15 09:58:32 +02:00
Nadav Har'El	fbcf77d134	raft: make group0 Raft operation timeout configurable A recent commit `370707b111` (re)introduced a timeout for every group0 Raft operation. This timeout was set to 60 seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody". However, one of the things we do as a group0 operation is schema changes, and we already noticed a few years ago, see commit `0b2cf21932`, that in some extremely overloaded test machines where tests run hundreds of times (!) slower than usual, a single big schema operation - such as Alternator's DeleteTable deleting a table and multiple of its CDC or view tables - sometimes takes more than 60 seconds. The above fix changed the client's timeout to wait for 300 seconds instead of 60 seconds, but now we also need to increase our Raft timeout, or the server can time out. We've seen this happening recently making some tests flaky in CI (issue #23543). So let's make this timeout configurable, as a new configuration option group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e, 60 seconds), the same as the existing default. The test framework overrides this default with a a higher 300 second timeout, matching the client-side timeout. Before this patch, this timeout was already configurable in a strange way, using injections. But this was a misstep: We already have more than a dozen timeouts configurable through the normal configration, and this one should have been configured in the same way. There is nothing "holy" about the default of 60 seconds we chose, and who knows maybe in the future we might need to tweek it in the field, just like we made the other timeouts tweakable. Injections cannot be used in release mode, but configuration options can. Fixes #23543 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23717	2025-04-15 10:57:39 +03:00
Kefu Chai	3e3f583b84	docs/dev/tombstone.md: fix a typo s/alwas/always/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23734	2025-04-15 10:54:42 +03:00
Avi Kivity	5e1cf90a51	build: replace tools/java submodule with packaged cassandra-stress We no longer use tools/java (scylladb/scylla-tools-java.git) for nodetool or cqlsh; only cassandra-stress. Since that is available in package form install that and excise the tools/java submodule from the source tree. pgo/ is adjusted to use the packaged cassandra-stress (and the cqlsh submodule). A few jmx references are dropped as well. Frozen toolchain regenerated. Optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Closes scylladb/scylladb#23698	2025-04-15 10:11:28 +03:00
Jenkins Promoter	9699c3ded4	Update pgo profiles - aarch64	2025-04-15 04:45:34 +03:00
Jenkins Promoter	8472aa9e53	Update pgo profiles - x86_64	2025-04-15 04:29:24 +03:00
Pavel Emelyanov	b25cb5af0c	Merge 'Use named gates' from Benny Halevy Name the gates and phased barriers we use to make it easy to debug gate_closed_exception Refs https://github.com/scylladb/seastar/pull/2688 * Enhancement only, no backport needed Closes scylladb/scylladb#23329 * github.com:scylladb/scylladb: utils: loading_cache: use named_gate utils: flush_queue: use named_gate sstables_manager: use named gate sstables_loader: use named gate utils: phased_barrier, pluggable: use named gate utils: s3::client::multipart_upload: use named gate utils: s3::client: use named_gate transport: controller: use named gate tracing: trace_keyspace_helper: use named gate task_manager: module: use named gate topology_coordinator: use named gate storage_service: use named gate storage_proxy: wait_for_hint_sync_point: use named gate storage_proxy: remote: use named gate service: session: use named gate service: raft: raft_rpc: use named gate service: raft: raft_group0: use named gate service: raft: persistent_discovery: use named gate service: raft: group0_state_machine: use named gate service: migration_manager: use named gate replica: table: use named gate replica: compaction_group, storage_group: use named gate redis: query_processor: use named gate repair: repair_meta: use named gate reader_concurrency_semaphore: use named gate raft: server_impl: use named gate querier_cache: use named gate gms: gossiper: use named gate generic_server: use named gate db: sstables_format_listener: use named gate db: snapshot: backup_task: use named gate db: snapshot_ctl: use named gate hints: hints_sender: use named gate hints: manager: use named gate hints: hint_endpoint_manager: use named gate commitlog: segment_manager: use named gate db: batchlog_manager: use named gate query_processor: remote: use named gate compaction: compaction_state: use named gate alternator/server: use named_gate	2025-04-14 20:56:32 +03:00
Sergey Zolotukhin	e05c082002	Ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group.	2025-04-14 17:10:46 +02:00
Sergey Zolotukhin	60f1053087	Move RAFT operations verbs to GOSSIP group. In order for RAFT operations to use the gossip system semaphore, moving RAFT verbs to the gossip group in `do_get_rpc_client_idx`, messaging_service. Fixes scylladb/scylladb21637	2025-04-14 17:09:49 +02:00
Pavel Emelyanov	1bd991a111	test: Inherit sstable_assertions from sstables::test The latter class is invented to let tests access private fields of an sstable (mostly methods). The former is in fact an extended version of that also does some checks. Howerver, they don't inherit from each other, and the sstable_assertions partially duplicates some funtionality of the test one. Add the inheritance, remove the duplicated methods from the child class, update the callers (the test class returns future<>s, the assertions one "knows" it runs in seastar thread) and marm sstable::read_toc() private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23697	2025-04-14 13:45:14 +03:00
Kefu Chai	b3f709bed7	s3: remove an extraneous space Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23714	2025-04-14 13:02:58 +03:00
Michał Chojnowski	6e2795a843	Update seastar submodule * seastar ed8952fb...099cf616 (10): > reactor: Disable hot polling if wakeup granularity is too high > smp: add shard_to_numa_node_mapping() > tests/unit/httpd_test: fix the handling of NUL bytes in the parser > fstream: skip allocation in no write_behinds case > `http`: add `xml` support to `http::mime_types::mappings` > Print incrementally in sigsegv handler > reactor: use 0x for hex addresses > tls: Make session resume key shared across credentials builders creds > build: fix CMAKE_REQUIRED_FLAGS format for sanitizer detection > reactor: Remove sched_debug() related code Closes scylladb/scylladb#23703	2025-04-14 12:54:19 +03:00
Andrei Chekun	8e33d7ab81	test.py: Make the testpy log files in pytest follow the same format Fix the incorrect log file names between conftest and scylla_manager. This regression issue, was introduced in #22960. Currently, scylla manager will output it's logs to the file with the next pattern: suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log On the same time pytest will try to find this log with next name: suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log This inconsistency leads to the situation when the test failed, scylla manager log file will not be copied to the failed_test directory and test will have exception on teardown. Closes scylladb/scylladb#23596	2025-04-14 12:52:48 +03:00
Evgeniy Naydanov	d6b64642c5	test.py: print out path to Scylla log for Python test suites Test suites with `type: Python` are using single Scylla node created by test.py, but it's handy to print a path to a log file in pytest log too to make it easier to find the file on failures. Closes scylladb/scylladb#23683	2025-04-14 11:15:37 +03:00
Kefu Chai	69de816b1b	scylla-gdb.py: fix a typo in gdb command description replace "runnign" with "running". Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23716	2025-04-14 10:59:21 +03:00
Benny Halevy	8d7e4d6c36	utils: loading_cache: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:09 +03:00
Benny Halevy	46f2a24772	utils: flush_queue: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:02 +03:00
Benny Halevy	d665bb4f8b	sstables_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	7969293dcf	sstables_loader: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	d3f498ae59	utils: s3::client::multipart_upload: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	eea83464c7	utils: s3::client: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:46:51 +03:00
Benny Halevy	79e967e2f5	transport: controller: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:48 +03:00
Benny Halevy	3d87b67d0e	tracing: trace_keyspace_helper: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:48 +03:00
Benny Halevy	bfdd8a98ca	task_manager: module: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:48 +03:00
Benny Halevy	5e864b6277	topology_coordinator: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:29:46 +03:00
Benny Halevy	a67ed59399	storage_service: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	39f1175451	storage_proxy: wait_for_hint_sync_point: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	e228a112fe	storage_proxy: remote: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	0a1e7de6ea	service: session: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	747446cb25	service: raft: raft_rpc: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	01bb3980fc	service: raft: raft_group0: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	6118150d44	service: raft: persistent_discovery: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	e430df6332	service: raft: group0_state_machine: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	5f8b5724e6	service: migration_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	7342a57cbb	replica: table: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	52e1ce7f0d	replica: compaction_group, storage_group: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	aff6017e83	redis: query_processor: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	80b5089d0c	repair: repair_meta: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	679e73053f	reader_concurrency_semaphore: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	9724d87e86	raft: server_impl: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	5780599eec	querier_cache: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	cecfb6dfd7	gms: gossiper: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	bc69bc3de7	generic_server: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	5a71763d75	db: sstables_format_listener: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	da492231df	db: snapshot: backup_task: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	edf497c170	db: snapshot_ctl: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	c5d7272393	hints: hints_sender: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	1c1adb3d60	hints: manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	4c475a1905	hints: hint_endpoint_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	bdd5a61139	commitlog: segment_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	0672c9da5c	db: batchlog_manager: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	f8d5835cab	query_processor: remote: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	747ae5e1c4	compaction: compaction_state: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Benny Halevy	879811e0d2	alternator/server: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Dawid Mędrek	be0877ce69	test/cqlpy: Enable rf_rack_valid_keyspaces by default All of the tests in the suite have been adjusted so they only use RF-rack-valid keyspaces, so let's start enabling the option by default.	2025-04-11 14:55:13 +02:00
Dawid Mędrek	a59842257a	test: Move test_alter_tablet_keyspace_rf to cluster suite We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the cluster test suite. The reason behind the change is that the test cannot be run with `rf_rack_valid_keyspaces` turned on in the configuration. During the test, we make the keyspace RF-rack-invalid multiple times. Since RF-rack-validity is a very strong constraint, adjust the test otherwise is impossible. By moving it to the cluster test suite, we're able to change the configuration of the node used in the test, and so the test can work again.	2025-04-11 14:55:11 +02:00
Dawid Mędrek	958eaec056	test/cqlpy: Adjust tests to RF-rack-valid keyspaces	2025-04-11 14:55:04 +02:00
Dawid Mędrek	6bde01bb59	test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces We adjust three existing Cassandra tests so that they don't create RF-rack-invalid keyspaces. We modify the replication factor used in the problematic tests. The changes don't affect the tests as the value of the RF is unrelated to what they verify. Thanks to that, we can run them now even with enforced RF-rack-valid keyspaces. The drawback is that the modified ALTER statements do not modify the RF at all. However, since the tests seem to verify that the code responsible for VALIDATING a request works as intended, that should have little to no impact on them.	2025-04-11 14:20:14 +02:00
Dawid Mędrek	10589e966f	test/cluster/mv: Adjust test to RF-rack-valid keyspaces We adjust the test in the directory so that all of the used keyspaces are RF-rack-valid throughout the their execution. Refs scylladb/scylladb#23428 Closes scylladb/scylladb#23490	2025-04-11 14:03:21 +02:00
Karol Baryła	df64985a4e	Docs: Describe driver issue with tablet RF increase Current protocol extension that sends tablet info to drivers only does that if the driver selects a non-replica coordinator for a routable request. It works well if some node on the replica list is replaced by other node, or if some replicas are removed from the list. Driver will at some point send a request to stale replica, and receive new list in response. The issue is with extending the list with new replicas. In that case old replicas are all still correct, so driver will not select any wrong replica, and will not receive the new list. As far as I know that only scenario where this could happen is RF increase. It could be to some degree worked around in the drivers, but it would add significant complexity (definitely more than any other invalidations we introduced) while still not being ideal solution. This scenario should be rare enough, and the consequences of not handling it minor enough (new replicas not being used as coordinators) that it does not warrant driver-side solution. Instead this commit adds info about this to documentation, advising users to restart applications after replica lists are extended. It is worth noting that if new tablet feedback protocol extension is implemented then this problem goes away. See issue #21664. Closes scylladb/scylladb#23447	2025-04-11 13:48:40 +02:00
David Garcia	cf11d5eb69	fix: openapi not rendering in docs.scylladb.com/manual Closes scylladb/scylladb#23686	2025-04-10 17:47:58 +03:00
Patryk Jędrzejczak	07a7a75b98	Merge 'raft: implement the limited voters feature' from Emil Maskovsky Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures. Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested). Tests added: * boost/group0_voter_registry_test.cc: run time on CI: ~3.5s * topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total Fixes: scylladb/scylladb#18793 No backport: This is a new feature that will not be backported. Closes scylladb/scylladb#21969 * https://github.com/scylladb/scylladb: raft: distribute voters by rack inside DC raft/test: fix lint warnings in `test_raft_no_quorum` raft/test: add the upgrade test for limited voters feature raft topology: handle on_up/on_down to add/remove node from voters raft: fix the indentation after the limited voters changes raft: implement the limited voters feature raft: drop the voter removal from the decommission raft/test: disable the `stop_before_becoming_raft_voter` test raft/test: stop the server less gracefully in the voters test	2025-04-10 15:29:15 +02:00
Avi Kivity	9559e53f55	Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph. To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer. Closes scylladb/scylladb#23584 * github.com:scylladb/scylladb: tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization tablet-mon.py: Center tablet id text properly in the vertical axis tablet-mon.py: Show migration stage tag in table mode only when migrating virtual-tables: Introduce system.load_per_node virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() docs: virtual-tables: Fix instructions service: tablets: Keep load_stats inside tablet_allocator	2025-04-10 14:59:08 +03:00
Avi Kivity	885838fc46	Merge 'scylla-gdb.py: improve scylla repairs command' from Botond Dénes Make output more readable by: * group follower/master repair instances separately * split repair details into one line for repair summary, then one line for each host info * add indentation to make the output easier to follow Also add `-m\|--memory` option to calculate memory usage of repair buffers. Example output: (gdb) scylla repairs -m Repairs for which this node is leader: (repair_meta) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished (repair_meta) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished (repair_meta) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished (repair_meta) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished Repairs for which this node is follower: Closes scylladb/scylladb#23075 * github.com:scylladb/scylladb: scylla-gdb.py: improve scylla repairs commadn scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__ scylla-gdb.py: introduce managed_bytes	2025-04-10 14:52:43 +03:00
Dani Tweig	e92740cc2b	.github: update bug_report.yml Perform a yaml "face lift" on the old bug report md template, making bug reporting more efficient. - Add dedicated textarea fields for problem description and expected behavior - Include pre-filled placeholders to guide issue reporting - Add formatted log output section with shell syntax highlighting Closes: #21532	2025-04-10 14:26:00 +03:00
Pavel Emelyanov	88318d3b50	topology_coordinator: Use shorter fault-injection overloads There are few places that want to pause until a message is received from the test. There's a convenience one-line suger to do it. One test needs update its expectations about log message that appears when scylle steps on it and actually starts waiting. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23390	2025-04-10 14:05:46 +03:00
Botond Dénes	d67202972a	mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling This adaptor adapts a mutation reader pausable consumer to the frozen mutation visitor interface. The pausable consumer protocol allows the consumer to skip the remaining parts of the partition and resume the consumption with the next one. To do this, the consumer just has to return stop_iteration::yes from one of the consume() overloads for clustering elements, then return stop_iteration::no from consume_end_of_partition(). Due to a bug in the adaptor, this sequence leads to terminating the consumption completely -- so any remaining partitions are also skipped. This protocol implementation bug has user-visible effects, when the only user of the adaptor -- read repair -- happens during a query which has limitations on the amount of content in each partition. There are two such queries: select distinct ... and select ... with partition limit. When converting the repaired mutation to to query result, these queries will trigger the skip sequence in the consumer and due to the above described bug, will skip the remaining partitions in the results, omitting these from the final query result. This patch fixes the protocol bug, the return value of the underlying consumer's consume_end_of_partition() is now respected. A unit test is also added which reproduces the problem both with select distinct ... and select ... per partition limit. Follow-up work: * frozen_mutation_consumer_adaptor::on_end_of_partition() calls the underlying consumer's on_end_of_stream(), so when consuming multiple frozen mutations, the underlying's on_end_of_stream() is called for each partition. This is incorrect but benign. * Improve documentation of mutation_reader::consume_pausable(). Fixes: #20084 Closes scylladb/scylladb#23657	2025-04-10 13:19:57 +03:00
Pavel Emelyanov	4de48a9d24	encryption: Mark parts of encrypted_data_sink private Nowadays the whole class is public, but it's not in fact such. Remove the SUDDENLY unused private _flush_pos member to please the compiler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23677	2025-04-10 12:42:57 +03:00
Dawid Mędrek	0ed21d9cc1	test/cluster/test_tablets.py: Fix test errorneous indentation Some of the statements in the test are not indented properly and, as a result, are never run. It's most likely a small mistake, so let's fix it. Closes scylladb/scylladb#23659	2025-04-10 11:06:01 +03:00
Nadav Har'El	258213f73b	Merge 'Alternator batch count histograms' from Amnon Heiman This series adds a histogram for get and write batch sizes. It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works extremely tight to 20 but still covers all the way to 100. Histograms will be reported per node. Backport to 2025.1 so we'll have information about user batch size limitation Closes scylladb/scylladb#23379 * github.com:scylladb/scylladb: alternator: Add tests for the batch items histograms alternator: Add histogram for batch item count	2025-04-09 22:41:14 +03:00
Tomasz Grabiec	b5211cca85	Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk Currently, when we rebuild a tablet, we stream data from all replicas. This creates a lot of redundancy, wastes bandwidth and CPU resources. In this series, we split the streaming stage of tablet rebuild into two phases: first we stream tablet's data from only one replica and then repair the tablet. Fixes: https://github.com/scylladb/scylladb/issues/17174. Needs backport to 2025.1 to prevent out of space during streaming Closes scylladb/scylladb#23187 * github.com:scylladb/scylladb: test: add test for rebuild with repair locator: service: move to rebuild_v2 transition if cluster is upgraded locator: service: add transition to rebuild_repair stage for rebuild_v2 locator: service: add rebuild_repair tablet transition stage locator: add maybe_get_primary_replica locator: service: add rebuild_v2 tablet transition kind gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-09 21:35:37 +02:00
Avi Kivity	ed3e4f33fd	Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency. It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed. New connections_shed and connections_blocked metrics are added for tracking. Testing: - manually via simple program creating high number of connection and constantly re-connecting - added benchmark Following are benchmark results: Before: ``` > build/release/test/perf/perf_generic_server --smp=1 170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors) [...] throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54 instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96 cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15 ``` After: ``` > build/release/test/perf/perf_generic_server --smp=1 167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors) [...] throughput: mean= 171199.55 standard-deviation=2484.58 median= 171667.06 median-absolute-deviation=2087.63 maximum=173689.11 minimum=167904.76 instructions_per_op: mean= 4801.90 standard-deviation=16.54 median= 4796.78 median-absolute-deviation=9.32 maximum=4830.71 minimum=4789.81 cpu_cycles_per_op: mean= 3245.26 standard-deviation=32.28 median= 3230.44 median-absolute-deviation=16.52 maximum=3297.39 minimum=3215.62 ``` The patch adds around 67 insns/op so it's effect on performance should be negligible. Fixes: https://github.com/scylladb/scylladb/issues/22844 Closes scylladb/scylladb#22828 * github.com:scylladb/scylladb: transport: move on_connection_close into connection destructor test: perf: make aggregated_perf_results formatting more human readable transport: add blocked and shed connection metrics generic_server: throttle and shed incoming connections according to semaphore limit generic_server: add data source and sink wrappers bookkeeping network IO generic_server: coroutinize part of server::do_accepts test: add benchmark for generic_server test: perf: add option to count multiple ops per time_parallel iteration generic_server: add semaphore for limiting new connections concurrency generic_server: add config to the constructor generic_server: add on_connection_ready handler	2025-04-09 21:41:38 +03:00
Tomasz Grabiec	5b5ada1743	tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization Per-node capacity is queried from system.load_per_node Tablet height in each node is scaled so that equal height = equal node utilization. The nominal height is assigned to the node which has the smallest capacity, so nodes with higher capacity will have smaller tablets than normal.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	217184f16b	tablet-mon.py: Center tablet id text properly in the vertical axis Was too low due to not subtracting frame size from height	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	20cac72056	tablet-mon.py: Show migration stage tag in table mode only when migrating It's the gray bar at the top of the tablet. It's not showing useful information when tablet is not migrating.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	0b9a75d7b6	virtual-tables: Introduce system.load_per_node Can be used to query per-node stats about load as seen by the load balancer. In particular, node's capacity will be used by tablet-mon.py to scale tablet columns so that equal height is equal node utilization.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	668094dc58	virtual_tables: memtable_filling_virtual_table: Propagate permit to execute() So that population can access read's timeout and mark the permit as awaiting.	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	34beaa30b5	docs: virtual-tables: Fix instructions	2025-04-09 20:21:51 +02:00
Tomasz Grabiec	76bc11c78c	service: tablets: Keep load_stats inside tablet_allocator So that virtual tables can pick them up. It's a better place to keep them than in topology_coordinator.	2025-04-09 20:21:51 +02:00
Pavel Emelyanov	d9853efa7c	Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy The motivation behind this change to free up disk space as early as possible. The reason is that snapshot locks the space of all SSTables in the snapshot, and deleting form the table, for example, by compaction, or tablet migration, won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot. This series adds prioritization of deleted sstables in two cases: First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the list of SSTables presently in the table and any generation that is not in the table is prioritized to be uploaded earlier. In addition, a subscription mechanism was added to sstables_manager and it is used in backup to prioritize SSTables that get deleted from the table directory during backup. This is particularly important when backup happens during high disk utilization (e.g. 90%). Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the snapshot taken for backup. * Enhancement, no backport needed Closes scylladb/scylladb#23241 * github.com:scylladb/scylladb: db: snapshot: backup_task: prioritize sstables deleted during upload sstables_manager: add subscriptions db: snapshot: backup_task: limit concurrency sstables: directory_semaphore: expose get_units db: snapshot: backup_task: add sharded sstables_manager database: expose get_sstables_manager(schema) db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table db: snapshot-ctl: pass table_id to backup_task db: snapshot-ctl: expose sharded db() getter db: snapshot: backup_task: do_backup: organize components by sstable generation db: snapshot: coroutinize backup_task db: snapshot: backup_task: refactor backup_file out of uploads_worker db: snapshot: backup_task: refactor uploads_worker out of do_backup db: snapshot: backup_task: process_snapshot_dir: initialize total progress utils/s3: upload_progress: init members to 0 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir db: snapshot: backup_task: keep expection as member	2025-04-09 15:32:11 +03:00
Marcin Maliszkiewicz	ce18909688	transport: move on_connection_close into connection destructor To make the code more robust by ensuring closing code is always executed.	2025-04-09 13:50:19 +02:00
Pavel Emelyanov	35dfc8c782	Merge 'audit: add semaphore to audit_syslog_storage_helper' from Andrzej Jackowski audit_syslog_storage_helper::syslog_send_helper uses Seastar's net::datagram_channel to write to syslog device (usually /dev/log). However, datagram_channel.send() is not fiber-safe (ref seastar#2690), so unserialized use of send() results in packets overwriting its state. This, in turn, causes a corruption of audit logs, as well as assertion failures. To workaround the problem, a new semaphore is introduced in audit_syslog_storage_helper. As storage_helper is a member of sharded audit service, the semaphore allows for one datagram_channel.send() on each shard. Each audit_syslog_storage_helper stores its own datagram_channel, therefore concurrent sends to datagram_channel are eliminated. This change: - Moved syslog_send_helper to audit_syslog_storage_helper - Corutinize audit_syslog_storage_helper - Introduce semaphore with count=1 in audit_syslog_storage_helper. See https://github.com/scylladb/scylla-dtest/pull/5749 for releated dtest Fixes: scylladb#22973 Backport to 2025.1 should be considered, as https://github.com/scylladb/scylladb/issues/22973 is known to cause crashes of 2025.1. Closes scylladb/scylladb#23464 * github.com:scylladb/scylladb: audit: add semaphore to audit_syslog_storage_helper audit: corutinize audit_syslog_storage_helper audit: moved syslog_send_helper to audit_syslog_storage_helper	2025-04-09 12:39:06 +03:00
Marcin Maliszkiewicz	619944555f	test: perf: make aggregated_perf_results formatting more human readable Before: throughput: mean=170728.58 standard-deviation=1921.76 median=171084.16 median-absolute-deviation=1501.58 maximum=172913.36 minimum=167288.97 instructions_per_op: mean=4685.89 standard-deviation=12.46 median=4683.92 median-absolute-deviation=9.68 maximum=4706.53 minimum=4666.70 cpu_cycles_per_op: mean=3090.94 standard-deviation=52.69 median=3103.43 median-absolute-deviation=24.55 maximum=3192.99 minimum=3003.00 After: throughput: mean= 168224.81 standard-deviation=854.48 median= 168829.02 median-absolute-deviation=604.21 maximum=168829.02 minimum=167620.60 instructions_per_op: mean= 4837.02 standard-deviation=20.89 median= 4851.79 median-absolute-deviation=14.77 maximum=4851.79 minimum=4822.24 cpu_cycles_per_op: mean= 3271.42 standard-deviation=46.29 median= 3304.16 median-absolute-deviation=32.73 maximum=3304.16 minimum=3238.69	2025-04-09 10:49:20 +02:00
Marcin Maliszkiewicz	599f4d312b	transport: add blocked and shed connection metrics This adds some visibility into connection storm mitigations added in following commits.	2025-04-09 10:49:18 +02:00
Marcin Maliszkiewicz	26518704ab	generic_server: throttle and shed incoming connections according to semaphore limit If we have uninitialized_connections_semaphore_cpu_concurrency (default 2) connections being processed we start delay accepting new connections. Connections which are in network IO state are not counted towards this limit and they can go to cpu phase without blocking. So it can happen that we process more concurrent new connections but that's a necessary tradeof to make progress during storm without implementing more advanced machinery (i.e. priority queue).	2025-04-09 10:48:51 +02:00
Marcin Maliszkiewicz	9f5de2c256	generic_server: add data source and sink wrappers bookkeeping network IO They release semaphore units when we start network IO and acquire it when we enter cpu intensive phase. We use consume() so it doesn't block because we don't want connections we started processing to compete with new incomming connections. Otherwise during connection storm we wouldn't make much progress. There will be a simplification here as we'll treat disc IO (if there is any) as cpu work.	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	c56116372e	generic_server: coroutinize part of server::do_accepts	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	719d04d501	test: add benchmark for generic_server Changes in configure.py are needed becuase we don't want to embed this benchmark in scylla binary as perf_simple_query or perf_alternator, it doesn't directly translate to Scylla performance but we want to use aggregated_perf_results for precise cpu measurements so we need different dependecies.	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	b957cedace	test: perf: add option to count multiple ops per time_parallel iteration	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	ed82bede39	generic_server: add semaphore for limiting new connections concurrency It will be used in following commits.	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	33122d3f93	generic_server: add config to the constructor	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	474e84199c	generic_server: add on_connection_ready handler This patch cleans the code a bit so that ready state is set in a single place. And adds handler which will allow adding logic when connection is made ready, this will be added in the following commits.	2025-04-09 10:30:58 +02:00
Benny Halevy	1ab3ec061b	db: snapshot: backup_task: prioritize sstables deleted during upload subscribe on each shard's sstables_manager to get callback notifications and keep the generation numbers of deleted sstables in a vector so they can be prioritized first to free up their disk space as soon as possible. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	d8b0c661e4	sstables_manager: add subscriptions Allow other submodules to subscribe for added/deleted notifications. This will be used in a later to patch to prioritize unlinked sstables for backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	d3b4874ec3	db: snapshot: backup_task: limit concurrency Otherwise, once all the background tasks are created we have no way to reorder the queue. Fixes #23239 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	e60fcc58b7	sstables: directory_semaphore: expose get_units To be used by a following patch for backup concurrency control. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	b7807ec165	db: snapshot: backup_task: add sharded sstables_manager Get a reference to the table's sstables_manager on each shard. This will be used be later patches to limit concurrency and to subscribe for notifications. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	b270d552fb	database: expose get_sstables_manager(schema) Return either the system or use sstables manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	9a4b4afade	db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table Detect SSTables that are already deleted from the table in process_snapshot_dir when their number_of_links is equal to 1. Note that the SSTable may be hard-linked by more than one snapshot, so even after it is deleted from the table, its number of links would be greater than one. In that case, however, uploading it earlier won't help to free-up its capacity since it is still held by other snapshots. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	4b8699e278	db: snapshot-ctl: pass table_id to backup_task To be used by the following patches to get to the table's sstables_manager for concurrency control and for notifications (TBD). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	d646603bfd	db: snapshot-ctl: expose sharded db() getter Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Benny Halevy	63bc1d4626	db: snapshot: backup_task: do_backup: organize components by sstable generation Do not rely on the snapshot directory listing order. This will become useful for prioritizing unlinked sstables in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:06 +03:00
Benny Halevy	a731c1b33d	db: snapshot: coroutinize backup_task Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:53 +03:00
Benny Halevy	189075b885	db: snapshot: backup_task: refactor backup_file out of uploads_worker Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:53 +03:00
Benny Halevy	e3ba425c2b	db: snapshot: backup_task: refactor uploads_worker out of do_backup Let do_backup deal only with the high level coordination. A future patch will follow this structure to run uploads_worker on each shard. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:53 +03:00
Benny Halevy	ff25b4c97f	db: snapshot: backup_task: process_snapshot_dir: initialize total progress Now we can calculate advance how much data we intend to upload before we start uploading it. This will be used also later when uploading in parallel on all shards, so we can collect the progress from all shards in get_progress(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:49:51 +03:00
Benny Halevy	6da215e8af	utils/s3: upload_progress: init members to 0 For default construction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Benny Halevy	70307e8120	db: snapshot: backup_task: do_backup: refactor process_snapshot_dir Do preliminary listing of the snapshot dir. While at it, simplify the loop as follows: The optional directory_entry returned by snapshot_dir_lister.get() can be checked as part of the loop condition expression, and with that, error handling can be simplified and moved out of the loop body. A followup patch will organize the component files by their sstable generation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> db: snapshot: backup_task: process_snapshot_dir: simplify loop Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Benny Halevy	8a4b6b9614	db: snapshot: backup_task: keep expection as member As part of refactoring do_backup(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Botond Dénes	b65a76ab6f	Merge 'nodetool: cluster repair: add a command to repair tablet keyspaces' from Aleksandra Martyniuk Add a new nodetool cluster super-command. Add nodetool cluster repair command to repair tablet keyspaces. It uses the new /storage_service/tablets/repair API. The nodetool cluster repair command allows you to specify the keyspace and tables to be repaired. A cluster repair of many tables will request /storage_service/tablets/repair and wait for the result synchronously for each table. The nodetool repair command, which was previously used to repair keyspaces of any type, now repairs only vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/22409. Needs backport to 2025.1 that introduces the new tablet repair API Closes scylladb/scylladb#22905 * github.com:scylladb/scylladb: docs: nodetool: update repair and add tablet-repair docs test: nodetool: add tests for cluster repair command nodetool: add cluster repair command nodetool: repair: extract getting hosts and dcs to functions nodetool: repair: warn about repairing tablet keyspaces nodetool: repair: move keyspace_uses_tablets function	2025-04-09 08:20:34 +03:00
Botond Dénes	5f697d373f	test/cqlpy/test_tools.py: use AIO backend in scylla-sstable query tests These tests seem to be hitting the io-uring bug in the kernel from time-to-time, making CI flaky. Force the use of the AIO backend in these tests, as a workaround until fixed kernels (>=6.8.13) are available. Fixes: #23517 Fixes: #23546 Closes scylladb/scylladb#23648	2025-04-08 20:29:58 +03:00
Benny Halevy	dfdca2d84e	locator: topology: drop unused calculate_datacenters Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23647	2025-04-08 19:04:56 +03:00
Tomasz Grabiec	06b49bdf69	Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. Closes scylladb/scylladb#23255 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-08 17:26:58 +02:00
Andrzej Jackowski	c12f976389	audit: add semaphore to audit_syslog_storage_helper audit_syslog_storage_helper::syslog_send_helper uses Seastar's net::datagram_channel to write to syslog device (usually /dev/log). However, datagram_channel.send() is not fiber-safe (ref seastar#2690), so unserialized use of send() results in packets overwriting its state. This, in turn, causes a corruption of audit logs, as well as assertion failures. To workaround the problem, a new semaphore is introduced in audit_syslog_storage_helper. As storage_helper is a member of sharded audit service, the semaphore allows for one datagram_channel.send() on each shard. Each audit_syslog_storage_helper stores its own datagram_channel, therefore concurrent sends to datagram_channel are eliminated. This change: - Introduce semaphore with count=1 in audit_syslog_storage_helper. - Added 1 hour timeout to the semaphore, so semaphore stalls are failed just as all other syslog auditing failures. Fixes: scylladb#22973	2025-04-08 16:24:42 +02:00
Andrzej Jackowski	889fd5bc9f	audit: corutinize audit_syslog_storage_helper This change: - Corutinize audit_syslog_storage_helper::syslog_send_helper - Corutinize audit_syslog_storage_helper::start - Corutinize audit_syslog_storage_helper::write	2025-04-08 16:24:42 +02:00
Andrzej Jackowski	dbd2acd2be	audit: moved syslog_send_helper to audit_syslog_storage_helper This change: - Make syslog_send_helper() a method of audit_syslog_storage_helper, so syslog_send_helper() can access private members of audit_syslog_storage_helper in the next commits. - Remove unneeded syslog_send_helper() arguments that now are class members.	2025-04-08 16:24:42 +02:00
Benny Halevy	f702adf6a5	main: fix typo in tablet allocator checkpoint message Inroduced in `b6705ad48b` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23211	2025-04-08 17:19:41 +03:00
Botond Dénes	583a813d17	docs/dev/tombstone.md: fix link to ddl.html Closes scylladb/scylladb#23622	2025-04-08 16:18:50 +03:00
Anna Stuchlik	93a7b3ac1d	doc: add enabling consistent topology updates to the 2025.1 upgrade guide-from-2024 This commit adds the procedure to enable consistent topology updates for upgrades from 2024.1 to 2025.1 (or from 2024.2 to 2025.1 if the feature wasn't enabled after upgrading from 2024.1 to 2024.2). Fixes https://github.com/scylladb/scylladb/issues/23650 Closes scylladb/scylladb#23651	2025-04-08 15:38:00 +03:00
Robert Bindar	4e3eb2fdac	Move direct_failure_detector from root to service/ direct_failure_detector used to be used by gms/ as well, but that's not the case anymore, so raft/ is the only user. Fixes #23133 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#23248	2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk	372b562f5e	test: add test for rebuild with repair	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	eb17af6143	locator: service: add transition to rebuild_repair stage for rebuild_v2 Modify write_both_read_old and streaming stages in rebuild_v2 transition kind: write_both_read_old moves to rebuild_repair stage and streaming stage streams data only from one replica.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	5d6041617b	locator: add maybe_get_primary_replica Add maybe_get_primary_replica to choose a primary replica out of custom replica set.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	b80e957a40	gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	9769d7a564	docs: nodetool: update repair and add tablet-repair docs	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	02fb71da42	test: nodetool: add tests for cluster repair command	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	8bbc5e8923	nodetool: add cluster repair command Add a new nodetool cluster repair command that repairs tablet keyspaces. Users may specify keyspace and tables that they want to repair. If the keyspace and tables are not specified, all tablet keyspaces are repaired. The command calls the new tablet repair API /storage_service/tablets/repair.	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	aa3973c850	nodetool: repair: extract getting hosts and dcs to functions	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	b81c81c7f4	nodetool: repair: warn about repairing tablet keyspaces Warn about an attempt to repair tablet keysapce with nodetool repair. A nodetool cluster repair command to repair tablet keyspaces will be added in the following patches.	2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk	cbde835792	nodetool: repair: move keyspace_uses_tablets function	2025-04-08 09:13:14 +02:00
Yaron Kaikov	2dc7ea366b	.github: Make "make-pr-ready-for-review" workflow run in base repo in `57683c1a50` we fixed the `token` error, but removed the checkout part which causing now the following error ``` failed to run git: fatal: not a git repository (or any of the parent directories): .git ``` Adding the repo checkout stage to avoid such error Fixes: https://github.com/scylladb/scylladb/issues/22765 Closes scylladb/scylladb#23641	2025-04-08 09:30:18 +03:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Botond Dénes	0d39091df2	test/boost/row_cache_test: add memtable overlap check tests Similar to test/cluster/test_data_resurrection_in_memtable.py but works on a single node and uses more low-level mechanism. These tests can also reproduce more advanced scenarios, like concurrent reads, with some reading from flushed memtables.	2025-04-08 00:11:36 -04:00
Botond Dénes	6c1f6427b3	replica/table: add error injection to memtable post-flush phase After the memtable was flushed to disk, but before it is merged to cache. The injection point will only active for the table specified in the "table_name" injection parameter.	2025-04-08 00:11:36 -04:00
Botond Dénes	f7938e3f8b	utils/error_injection: add a way to set parameters from error injection points With this, now it is possible to have two-way communication between the error injection point and its enabler. The test can enable the error injection point, then wait until it is hit, before proceedin.	2025-04-08 00:11:36 -04:00
Botond Dénes	34b18d7ef4	test/cluster: add test_data_resurrection_in_memtable.py Reproducers for #23252 and #23291 -- cache garbage collecting tombstones resurrecting data in the memtable.	2025-04-08 00:11:36 -04:00
Botond Dénes	e5afd9b5fb	test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts Such that a given index in the return hosts refers to the same underlying Scylla instance, as the same index in the passed-in nodes list. This is what users of this method intuitively expect, but currently the returned hosts list is unordered (has random order).	2025-04-08 00:11:36 -04:00
Botond Dénes	df09b3f970	replica/mutation_dump: don't assume cells are live Currently the dumper unconditionally extracts the value of atomic cells, assuming they are live. This doesn't always hold of course and attempting to get the value of a dead cell will lead to marshalling errors. Fix by checking is_live() before attempting to get the cell value. Fix for both regular and collection cells.	2025-04-08 00:11:36 -04:00
Botond Dénes	cb76cafb60	replica/database: do_apply() add error injection point So writes (to user tables) can be failed on a replica, via error injection. Should simplify tests which want to create differences in what writes different replicas receive.	2025-04-08 00:11:35 -04:00
Botond Dénes	d126ea09ba	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks.	2025-04-08 00:11:35 -04:00
Botond Dénes	7e600a0747	replica/memtable: add is_merging_to_cache() And set it when the memtable is merged to cache.	2025-04-08 00:11:35 -04:00
Botond Dénes	6b5b563ef7	db/row_cache: add overlap-check for cache tombstone garbage collection The cache should not garbage-collect tombstone which cover data in the memtable. Add overlap checks (get_max_purgeable) to garbage collection to detect tombstones which cover data in the memtable and to prevent their garbage collection.	2025-04-08 00:11:35 -04:00
Botond Dénes	c2518cdf1a	mutation/mutation_compactor: copy key passed-in to consume_new_partition() This doesn't introduce additional work for single-partition queries: the key is copied anyway on consume_end_of_stream(). Multi-partition reads and compaction are not that sensitive to additional copy added. This change fixes a bug in the compacting_reader: currently the reader passes _last_uncompacted_partition_start.key() to the compactor's consume_new_partition(). When the compactor emits enough content for this partition, _last_uncompacted_partition_start is moved from to emit the partition start, this makes the key reference passed to the compaction corrupt (refer to moved-from value). This in turn means that subsequent GC checks done by the compactor will be done with a corrupt key and therefore can result in tombstone being garbage-collected while they still cover data elsewhere (data resurrection). The compacting reader is violating the API contract and normally the bug should be fixed there. We make an exception here because doing the fix in the mutation compactor better aligns with our future plans: * The fix simplifies the compactor (gets rid of _last_dk). * Prepares the way to get rid of the consume API used by the compactor.	2025-04-08 00:11:35 -04:00
Avi Kivity	8d2a41db82	Merge "Fixes for gossiper conversion to host id" from Gleb " The series contains fixes to gossiper conversion to host id. There are two fixes where we could erroneously send outdated entry in a gossiper message and a fix for force_remove_endpoint which was not converted to work on host id and this caused it to not delete the entry in some cases (in replace with the same ip case). " * 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev: gossiper: send newest entry in a digest message gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter gossiper: move force_remove_endpoint to work on host id gossiper: do not send outdated endpoint in gossiper round	2025-04-07 17:04:28 +03:00
Michał Chojnowski	827d774241	test_sstable_compression_dictionaries: reproduce an internal error in debug logging Extend one of the test so that it reproduces #23624, by creating a situation where no-compression SSTables are handled with debug logging enabled.	2025-04-07 13:05:04 +02:00
Michał Chojnowski	056da4b326	compress: fix an internal error when a specific debug log is enabled While iterating over the recent `69684e16d8`, series I shot myself in the foot by defining `algorithm_to_name(algorithm::none)` to be an internal error, and later calling that anyway in a debug log. (Tests didn't catch it because there's no test which simultaneously enables the debug log and configures some table to have no compression). This proves that `algorithm_to_name` is too much of a footgun. Fix it so that calling `algorithm_to_name(algorithm::none)` is legal. In hindsight, I should have done that immediately.	2025-04-07 13:05:03 +02:00
dependabot[bot]	a899cae158	build(deps): bump sphinx-scylladb-theme from 1.8.5 to 1.8.6 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.5 to 1.8.6. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.5...1.8.6) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#23537	2025-04-07 13:42:19 +03:00
Emil Maskovsky	76ceaf129b	raft: distribute voters by rack inside DC Distribute the voters evenly across racks in the datacenters. When distributing the voters across datacenters, the datacenters with more racks will be preferred in case of a tie. Also, in case of asymmetric voter distribution (2 DCs), the DC with more racks will have more voters (if the node counts allow it). In case of a single datacenter, the voters will be distributed across racks evenly (in the similar manner as done for the whole datacenters). The intention is that similar to losing a datacenter, we want to avoid losing the majority if a rack goes down - so if there are multiple racks, we want to distribute the voters across them in such a way that losing the whole rack will not cause the majority loss (if possible).	2025-04-07 12:31:37 +02:00
Emil Maskovsky	831fae4bff	raft/test: fix lint warnings in `test_raft_no_quorum` Code cleanup - fixed lint warnings in `test_raft_no_quorum` test.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	92f6662cd1	raft/test: add the upgrade test for limited voters feature We test the upgrade scenario of the limited voters feature - first we start the cluster with the limited voters feature disabled ("old code"), then we upgrade the cluster to the version with the limited voters feature enabled ("new code"). The nodes are being upgraded one by one and we test that the cluster still works (doesn't e.g. lose the majority).	2025-04-07 12:31:37 +02:00
Emil Maskovsky	a740623fa1	raft topology: handle on_up/on_down to add/remove node from voters Adding and removing the voters based on the node up/down events. This improves the availability of the system by automatically adjusting the number of voters in the system to use the alive nodes in precedence. We can then also drop the voter removal from the `write_both_read_old` to further simplify the code - the node will be removed from the voters when it goes down. However we only can do that in case the feature is enabled.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	dc6afd47b7	raft: fix the indentation after the limited voters changes Fix the indentation that needs to be changed because of the added condition. This is done separately to make it easier to review the main commit with the functional changes.	2025-04-07 12:31:37 +02:00
Emil Maskovsky	1d06ea3a5a	raft: implement the limited voters feature Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated). The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks. After each node addition or removal the voters are recalculated and rebalanced if necessary. That means: * When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked). * When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs). * If a node addition or removal causes a change in number of datacenters (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution) Special conditions for various number of DCs: * 1 DC: Can have up to the maximum allowed number of voters (5 - see below) * 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose the majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters. * 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution. At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR. Currently the voter limits will not be configurable (we might introduce configurable limits later if that would be needed/requested). The feature is enabled by the `group0_limited_voters` feature flag to avoid issues with cluster upgrade (the feature will be only enabled once all nodes in the cluster are upgraded to the version supporting the feature). Fixes: scylladb/scylladb#18793	2025-04-07 12:31:18 +02:00
Lakshmi Narayanan Sreethar	750f4baf44	replica/table::do_apply : do not check for async gate's closure The `table::do_apply()` method verifies if the compaction group's async gate is open to determine if the compaction group is active. Closing this async gate prevents any new operations but waits for existing holders to exit, allowing their operations to complete. When holding a gate, holders will observe the gate as closed when it is being closed, but this is irrelevant as they are already inside the gate and are allowed to complete. All the callers of `table::do_apply()` already enter the gate before calling the method. So, the async gate check inside `table::do_apply()` will erroneously throw an exception when the compaction group is closing despite holding the gate. This commit removes the check to prevent this from happening. Fixes #23348 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#23579	2025-04-07 13:27:22 +03:00
Emil Maskovsky	8b186ab0ff	raft: drop the voter removal from the decommission In the particular case of node decommission, this code doesn't really matter in production and only confuses us. Losing majority is an extremely rare event, and for this code to help one would have to lose majority in a very specific way (exactly half of the nodes die in a short time window during decommission), which is unrealistic. In addition, this code will be completely irrelevant (and would never be executed) once we implement #23266. Refs: scylladb/scylladb#23266	2025-04-07 12:23:25 +02:00
Emil Maskovsky	00794af94d	raft/test: disable the `stop_before_becoming_raft_voter` test The workflow of becoming a voter changes with the "limited voters" feature, as the node will no longer become a voter on its own, but the votership is being managed by the topology coordinator. This therefore breaks the `stop_before_becoming_raft_voter` test, as that injection relies on the old behavior. We will disable the test for this particular case for now and address either fixing of complete removal of the test in a follow-up task. Refs: scylladb/scylladb#23418	2025-04-07 12:23:25 +02:00
Emil Maskovsky	57df5d013e	raft/test: stop the server less gracefully in the voters test Stopping the test gracefully might hide some issues, therefore we want to stop it forcefully to make sure that the code can handle it. Added a parameter to stop gracefully or less gracefully (so that we test both cases).	2025-04-07 12:22:19 +02:00
Pavel Emelyanov	10376b5b85	db: Re-use database::snapshot_table_on_all_shards() There are two snapshot-on-all-shards methods on the database -- the one that snapshots a keyspace and the one that snapshots a vector of tables. The latter snapshots a single table with a neat helper, while the former has the helper open-coded. Re-using the helper in keyspace snapshot is worth it, but needs to patch the helper to work on uuid, rather than ks:cf pair of strings. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23532	2025-04-07 11:55:43 +02:00
Nadav Har'El	84fd52315f	alternator: in GetRecords, enforce Limit to be <= 1000 Alternator Streams' "GetRecords" operation has a "Limit" parameter on how many records to return. The DynamoDB documentations says that the upper limit on this Limit parameter is 1000 - but Alternator didn't enforce this. In this patch we begin enforcing this highest Limit, and also add a test for verifying this enforcement. As usual, the new test passes on DynamoDB, and after this patch - also on Alternator. The reason why it's useful to have some upper limit on Limit is that the existing executor::get_records() implementation does not really have preemption points in all the necessary places. In particular, we have a loop on all returned records without preemption points. We also store the returned records in a RapidJson vector, which requires a contiguous allocation. Even before this patch, GetRecords had a hard limit of 1 MB of results. But still, in some cases 1 MB of results may be a lot of results, and we can see stalls in the aforementioned places being O(number of results). Fixes #23534 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23547	2025-04-07 12:52:03 +03:00
Kefu Chai	55777812d4	s3/client: Optimize file streaming with zero-copy multipart uploads When streaming files using multipart upload, switch from using `output_stream::write(const char*, size_t)` to passing buffer objects directly to `output_stream::write()`. This eliminates unnecessary memory copying that occurred when the original implementation had to defensively copy data before sending. The buffer objects can now be safely reused by the output stream instead of creating deep copies, which should improve performance by reducing memory operations during S3 file uploads. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23567	2025-04-07 12:50:06 +03:00
Avi Kivity	ac3d25eb44	sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables The incremental reader selector maintains an unordered_set of sstables that are already engaged, and uses std::views::filter to filter those out. It adds the sstable under consideration to the set, and if addition failed (because it's already in) then it filters it out. This breaks if the filter view is executed twice - the first pass will add every sstable to the set, and the second will consider every sstable already filtered. This is what happens with libstdc++ 15 (due to the addition of vector(from_range_t) constructor), which uses the first pass to calculate the vector size and the second pass to insert the elements into a correctly-sized vector. Fix by open-coding the loop. Closes scylladb/scylladb#23597	2025-04-07 12:49:04 +03:00
Gleb Natapov	a982db326e	gossiper: send newest entry in a digest message In cases where two entries have the same ip address send information only for the newest one. Now we send both which make the receiver use one of them at random and it may be outdated one (though it should only cause more data than needed to be requested).	2025-04-06 18:39:24 +03:00
Gleb Natapov	8d534ee68e	gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter	2025-04-06 18:39:24 +03:00
Gleb Natapov	6f53611337	gossiper: move force_remove_endpoint to work on host id Since the gossiper works on host ids now it is incorrect to leave this function to work on ip. It makes it impossible to delete outdated entry since the "gossiper.get_host_id(endpoint) != id" check will always be false for such entries (get_host_id() always returns most up -to-date mapping.	2025-04-06 18:39:24 +03:00
Amnon Heiman	b55f24c14d	alternator: Add tests for the batch items histograms This patch adds a test for the batch‑items histogram for both get and write operations. It update the check_increases_metric_exact helper function so that it would get a list of expected value and labels (labels can be None). This makes it easy to test multiple buckets in a histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-04-06 18:22:23 +03:00
Amnon Heiman	c060c0b867	alternator: Add histogram for batch item count This patch adds an estimated_histogram for alternator batch item count. estimated_histogram can be used with values starting from 1 with an exponential factor of 1.2, which nicely covers values up to 20, but with only 22 buckets it can reach all the way to 100 (plus infinity). Aside from the new histograms for get and write batches, a helper function was added to return the histogram in the metric format without changing its resolution (which is the metric’s default behaviour). The histogram will be reported once per node rather than once per shard. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-04-06 18:22:13 +03:00
Marcin Maliszkiewicz	b94acfb37b	test: remove alternator code from perf-simple-query This kind of benchmark was superseded by perf-alternator which has more options, workflows and most importantly measures overhead of http server layer (including json parsing). There is no need to maintain additional code in perf-simple-query. Closes scylladb/scylladb#23474	2025-04-06 18:15:16 +03:00
Pavel Emelyanov	d4f3a3ee4f	cql: Remove unused "initial_tablets" mention from guardrails All tablets configuration was moved into its own "with tablets" section, this option name cannot be met among replication factors. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23555	2025-04-06 16:52:07 +03:00
Gleb Natapov	df6cd87bcc	gossiper: do not send outdated endpoint in gossiper round Now that the gossiper map is id based there can be a situation where two entries have the same ip, Shadow round should send the newest one in this cased. The patch makes it so. Fixes: #23553	2025-04-06 15:08:03 +03:00
Nadav Har'El	431de48df9	test/alternator: test for item with many attributes A user complained that he couldn't read or write an item with more than 16 attributes (!) in Alternator. This isn't true, but I realized that we don't have a simple test for this case - all test use just a few attributes. So let's add such a test, doing PutItem, UpdateItem and GetItem with 400 attributes. Unsurprisingly, the test passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23568	2025-04-03 22:35:49 +03:00
Nadav Har'El	a9a6f9eecc	test/alternator: increase timeout in Alternator RBAC test On our testing infrastructure, tests often run a hundred times (!) slower than usual, for various reasons that we can't always avoid. This is why all our test frameworks drastically increase the default timeouts. We forgot to increase the timeout in one place - where Alternator tests use CQL. This is needed for the Alternator role-based access control (RBAC) tests, which is configured via CQL and therefore the Alternator test unusually uses CQL. So in this patch we increase the timeout of CQL driver used by Alternator tests to the same high timeouts (60-120 seconds) used by the regular CQL tests. As the famous saying goes, these timeouts should be enough for anyone. Fixes #23569. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23578	2025-04-03 22:31:08 +03:00
Benny Halevy	cdf9fe9e50	Update seastar submodule * seastar 2f13c461...ed8952fb (24): > file: explain dsync check in flush method > gate: add named_gate > tests: unit: add gate_test > reactor: Remove global task_quota extern declaration > future: Move report_failed_future to internal namespace > update boost cooking URL > smp: prefault: clear memory map after threads join > change format to sesatar::format > Prevent move / copy constructor / assignment on backtrace_buffer > Remove unnecesary flush calls from backtrace_buffer usage points > Make backtrace_buffer flush on destruction > Add `backtrace_buffer&` param to maybe_report_kernel_trace function > Prevent empty kernel callstack messages > Make cpu_stall_detector_linux_perf_event::maybe_report_kernel_trace function protected. > iotune: Add cli flag to force io depth > smp: prefault: decouple _stop_request from join_threads > reactor: more info, robustness on segfault > net/udp: fix ipv4_udp::next_port calculation > map_reduce: prevent mapper or reducer exception from poisoning state > build: Re-enable ASan's verify_asan_link_order check > tests: enable/disable internet-dependent tests at runtime > test: tls_test: rename test_simple_x509_client variants to avoid naming conflicts > tests: extend test.py to accept arbitrary ctest parameters from positional args > tests: add a handle for building tests in "offline" mode Closes scylladb/scylladb#23566	2025-04-03 19:45:37 +03:00
Botond Dénes	1198213000	Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378 Closes scylladb/scylladb#23478 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-03 16:32:53 +03:00
Botond Dénes	fcdae20fd1	Merge 'Add tablet enforcing option' from Benny Halevy This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing `enable_tablets` option. It can be set to the following values: disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option `tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether tablets are disabled or enabled by default for new keyspaces, respectively. In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}` keyspace option, when the keyspace is created. `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}` Refs scylladb/scylla-enterprise#4355 * Requires backport to 2025.1 Closes scylladb/scylladb#22273 * github.com:scylladb/scylladb: boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option db/config: add tablets_mode_for_new_keyspaces option	2025-04-03 16:32:19 +03:00
Kefu Chai	3760a1c85e	cql3: Remove unnecessary 'virtual' specifiers from final class methods Remove 'virtual' specifiers from member functions in final classes where they can never be overridden. This addresses Clang errors like: ``` /home/kefu/dev/scylladb/cql3/column_identifier.hh:85:21: error: virtual method 'to_string' is inside a 'final' class and can never be overridden [-Werror,-Wunnecessary-virtual-specifier] 85 \| virtual sstring to_string() const; \| ^ 1 error generated. ``` This change improves code clarity and maintainability by eliminating redundant modifiers that could cause confusion. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23570	2025-04-03 13:51:42 +03:00
Tomasz Grabiec	fe8187e594	Merge 'repair: release erm in repair_writer_impl::create_writer when possible' from Aleksandra Martyniuk Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed. Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked. Fixes: #23453. Needs backport to 2025.1 that introduces the tablet repair scheduler. Closes scylladb/scylladb#23455 * github.com:scylladb/scylladb: \test: add test to check concurrent migration and repair of two different tablets repair: release erm in repair_writer_impl::create_writer when possible	2025-04-03 11:15:08 +02:00
Botond Dénes	7bbfa5293f	test/cluster/test_read_repair.py: increase read request timeout This test enables trace-level logging for the mutation_data logger, which seems to be too much in debug mode and the test read times out. Increase timeout to 1minute to avoid this. Fixes: #23513 Closes scylladb/scylladb#23558	2025-04-03 10:42:11 +03:00
Botond Dénes	07510c07a0	readers/mutation_readers: queue_reader_handle_v2::push_end_of_stream() raise _ex if set Instead of raising std::runtime_error("Dangling queue_reader_handle_v2") unconditionally. push() already raises _ex if set, best to be consistent. Unconditionally raising std::runtime_error can cause an error to be logged, when aborting an operation involving a queue reader. Although the original exception passed to queue_reader_handle_v2::abort() is most likely handled by higher level code (not logged), the generic std::runtime_error raised is not and therefore is logged. Fixes: #23550 Closes scylladb/scylladb#23554	2025-04-03 10:39:56 +03:00
Pavel Emelyanov	3bf4768205	Merge 'Unify http transport in EAR to use seastar http client' from Calle Wilund Fixes #22925 Refs #22885 Some providers in EAR were written before seastar got its own native http connector (as it is). Thus hand-made connectivity is used there. This PR unifies the code paths, and also extract some abstraction between providers where possible. One big reason for this is the handling of abrupt disconnects and retries; Seastar has some handling of things like EPIPE and ECONNRESET situations, that can be safely ignored in a REST call iff data was in fact transferred etc. This PR mainly takes the usage of seastar httpclient from gcp connector, makes a wrapper matching most of the usage of local client in kms connector, ensures common functionality and the replaces the code in the individual connectors. Closes scylladb/scylladb#22926 * github.com:scylladb/scylladb: encryption::gcp: Use seastar http client wrapper encryption::kms: Drop local http client and use seastar wrapper encryption: Break out a "httpclient" wrapper for seastar httpclient	2025-04-03 10:35:14 +03:00
Kefu Chai	0cd6cf1dc5	main: Remove unused member variable `_sys_ks` Fixes a Clang error by removing the unused private field `sstable_dict_deleter::_sys_ks` that was flagged with: [-Werror,-Wunused-private-field] ``` /home/kefu/.local/bin/clang++ -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_PROGRAM_OPTIONS_NO_LIB -DSCYLLA_BUILD_MODE=release -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/build -isystem /home/kefu/dev/scylladb/seastar/include -isystem /home/kefu/dev/scylladb/build/RelWithDebInfo/seastar/gen/include -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -I/usr/include/p11-kit-1 -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++23 -flto=thin -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/= -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -ffile-prefix-map=/home/kefu/dev/scylladb/build/=build -march=westmere -Xclang -fexperimental-assignment-tracking=disabled -mllvm -inline-threshold=2500 -fno-slp-vectorize -ffat-lto-objects -std=gnu++23 -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -MD -MT CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -MF CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o.d -o CMakeFiles/scylla.dir/RelWithDebInfo/main.cc.o -c /home/kefu/dev/scylladb/main.cc /home/kefu/dev/scylladb/main.cc:1660:38: error: private field '_sys_ks' is not used [-Werror,-Wunused-private-field] 1660 \| db::system_keyspace& _sys_ks; \| ^ ``` The member variable is not referenced anywhere in the code, so removing it improves maintainability without affecting functionality. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23545	2025-04-02 20:07:39 +03:00
Evgeniy Naydanov	84a5037056	test.py: cluster/suite.yaml: update test filters After switching to subfolders the filter `run_in_debug` for random failures test was just copied as is, but need to include the subfolder, actually. Also, `test_old_ip_notification_repro` was deleted, so, we don't need it in the `skip_in_debug` list. Closes scylladb/scylladb#23492	2025-04-02 19:29:27 +03:00
Kefu Chai	a09ec9d60d	.github: add delay before checking for required PR labels Improve the GitHub workflow to prevent premature email notifications about missing labels. Previously, contributors without write permissions to the scylladb repo would receive immediate notification emails about missing required backport labels, even if they were in the process of adding them. This change introduces a 1-minute grace period before checking for required labels, giving contributors sufficient time to add necessary labels (like backport labels) to their pull requests before any warning notifications are sent. The delay makes the experience more user-friendly for non-maintainer contributors while maintaining the labeling requirements. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23539	2025-04-02 19:28:15 +03:00
Aleksandra Martyniuk	bae6711809	\test: add test to check concurrent migration and repair of two different tablets	2025-04-02 15:30:17 +02:00
Radosław Cybulski	c36614e16d	alternator: add size check to BatchItemWrite Add a size check for BatchItemWrite command - if the item count is bigger than configuration value `alternator_maximum_batch_write_size`, an error will be raised and no modification will happen. This is done to synchronize with DynamoDB, where maximum size of BatchItemWrite is 25. To avoid complaints from clients, who use our feature of BatchWriteItem being limitless we set default value to 100. Fixes #5057 Closes scylladb/scylladb#23232	2025-04-02 14:48:00 +03:00
Avi Kivity	882f405eed	Merge "Convert gossiper's endpoint state map to be host id based" from Gleb " The series makes endpoint state map in the gossiper addressable by host id instead of ips. The transition has implication outside of the gossiper as well. Gossiper based topology operations are affected by this change since they assume that the mapping is ip based. On wire protocol is not affected by the change as maps that are sent by the gossiper protocol remain ip based. If old node sends two different entries for the same host id the one with newer generation is applied. If new node has two ids that are mapped to the same ip the newer one is added to the outgoing map. Interoperability was verified manually by running mixed cluster. The series concludes the conversion of the system to be host id based. " * 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev: gossiper: make examine_gossiper private gossiper: rename get_nodes_with_host_id to get_node_ip treewide: drop id parameter from gossiper::for_each_endpoint_state treewide: move gossiper to index nodes by host id gossiper: drop ip from replicate function parameters gossiper: drop ip from apply_new_states parameters gossiper: drop address from handle_major_state_change parameter list gossiper: pass rpc::client_info to gossiper_shutdown verb handler gossiper: add try_get_host_id function gossiper: add ip to endpoint_state serialization: fix std::map de-serializer to not invoke value's default constructor gossiper: drop template from wait_alive_helper function gossiper: move get_supported_features and its users to host id storage_service: make candidates_for_removal host id based gossiper: use peers table to detect address change storage_service: use std::views::keys instead of std::views::transform that returns a key gossiper: move _pending_mark_alive_endpoints to host id gossiper: do not allow to assassinate endpoint in raft topology mode gossiper: fix indentation after previous patch gossiper: do not allow to assassinate non existing endpoint	2025-04-02 12:30:00 +03:00
Pavel Emelyanov	832d83ae4b	sstables_loader: Do not stop sharded<progress_monitor> unconditionally The member in question is unconditionally .stop()-ed in task's release_resources() method, however, it may happen that the thing wasn't .start()-ed in the first place. Start happens in the middle of the task's .run() method and there can be several reasons why it can be skipped -- e.g. the task is aborted early, or collecting sstables from S3 throws. fixes: #23231 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23483	2025-04-02 12:09:02 +03:00
Kefu Chai	6da758d74c	config: mark uuid_sstable_identifiers_enabled unused the option of `uuid_sstable_identifier_enabled` was introduced in `f014ccf3` . the first version which has this change was 5.4, and 6.1 has been branched. during the discussion of backup and restore, we realized that we've been taking efforts to address problems which could have been addressed with the sstable with UUID-based identifier. see also #10459 which is the issue which proposed to implement UUID-v1 based sstable identifier. now that two major releases passed, we should have the luxury to mark this option "unused". this option which was previously introduced to keep the backward compatibility, and to allow user to opt-out of the feature for some reasons. so in this change, mark the option unused, so that if any user still sets this option with command line, they will get a clear error. but we still parse and handle this setting in `scylla.yaml`, so that this option is still respected for existing settings, and for existing tests, which are not yet prepared for the uuid-based sstable identifiers. Refs #10459 Fixes #20337 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20341	2025-04-01 20:21:47 +03:00
Botond Dénes	3bad46a6e2	docs/dev: add tombstone.md An exhaustive document on the tombstone related internal logic as well as the user-facing aspects. Closes scylladb/scylladb#23454	2025-04-01 20:17:57 +03:00
Botond Dénes	a0d8102a1f	replica/memtable: s/make_flat_reader/make_mutation_reader/ Following the recent refactoring of removing "flat" and "v2" from reader names, replacing all the fully qualified names with simply "mutation_reader". Closes scylladb/scylladb#23346	2025-04-01 17:58:13 +03:00
Artsiom Mishuta	032b28d793	test.py: remove pylib_test from test.py/CI run pylib_test contains one pure Python test. This test does not test Scylla. This test is not deleted because it can be useful to run during pre-commit, for example, but it definitely should not be run in CI in modes with 3 repeats each. It does not make sense. It is a Unit test for test.py framework. Note: test still can be easily run by pytest via the command: ./tools/toolchain/dbuild pytest test/pylib_test Closes scylladb/scylladb#23181	2025-04-01 16:43:45 +03:00
Pavel Emelyanov	2ee9cec1d3	Merge 'Remove object_storage.yaml and move the endpoints to scylla.yaml' from Robert Bindar Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Also, `object_storage_config_file` options is moved to a deprecated state as it's no longer needed. This PR depends on #22951, the reviewers should review patch 393e1ac0ec066475ca94094265a5f88dbbdb1a1f Refs https://github.com/scylladb/scylladb/issues/22428 Closes scylladb/scylladb#22952 * github.com:scylladb/scylladb: Remove db::config::object_storage_config Move `object_storage.yaml` endpoints to `scylla.yaml`	2025-04-01 16:01:44 +03:00
Avi Kivity	69684e16d8	Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes: - We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards. - We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options). - We add a cluster feature which guards the creation of dictionary-compressed SSTables. - We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards. - We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries. - We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that). - We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio. - We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement. Known imperfections: - The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first. New feature, no backporting involved. Closes scylladb/scylladb#23025 * github.com:scylladb/scylladb: docs: add user-facing documentation for SSTable compression with shared dicts docs/dev: add sstable-compression-dicts.md test: add test_sstable_compression_dictionaries_autotrain.py test: add test_sstable_compression_dictionaries_basic.py test/pylib/rest_client: add `keyspace_upgrade_sstables` helper main: run a sstable_dict_autotrainer api: add the estimate_compression_ratios API call dict_autotrainer: introduce sstable_dict_autotrainer db/system_keyspace: add query_dict_timestamp compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor main: clean up sstable compression dicts after table drops sstables/compress: discard hidden compression options after the decompressor is created compress: change compressor_ptr from shared_ptr to unique_ptr api: add the retrain_dict API call storage_service: add some dict-related routines main: in compression_dict_updated_callback, recognize and use SSTable compression dicts storage_service: add do_sample_sstables() messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback database: add sample_data_files() database: add take_sstable_set_snapshot() compress: teach `lz4_processor` about dictionaries compress: teach `zstd_processor` about dictionaries sstables: delegate compressor creation to the compressor factory sstables: plug an `sstable_compressor_factory` into `sstables_manager` sstables: introduce sstable_compressor_factory utils/hashers: add get_sha256() gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature compress: add hidden dictionary options compress: remove `compression_parameters::get_compressor()` sstables/compress: remove get_sstable_compressor() sstables/compress: move ownership of `compressor` to `sstable::compression` compress: remove compressor::option_names() compress: clean up the constructor of zstd_processor compress: squash zstd.cc into compress.cc sstables/compress: break the dependency of `compression_parameters` on `compressor` compress.hh: switch compressor::name() from an instance member to a virtual call bytes: adapt fmt_hex to std::span<const std::byte>	2025-04-01 12:47:34 +03:00
Aleksandra Martyniuk	1dc29ddc86	repair: release erm in repair_writer_impl::create_writer when possible Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed. Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked. Fixes: #23453.	2025-04-01 11:34:21 +02:00
Calle Wilund	c6674619b7	encryption::gcp: Use seastar http client wrapper Refs #22925 Remove direct usage of seastar http client, and instead share this with other connectors via the http client wrapper type.	2025-04-01 08:18:05 +00:00
Calle Wilund	491748cde3	encryption::kms: Drop local http client and use seastar wrapper Fixes #22925 Removes the boost based http client in favour of our seastar wrapper.	2025-04-01 08:18:05 +00:00
Calle Wilund	878f76df1f	encryption: Break out a "httpclient" wrapper for seastar httpclient Refs #22925 Adds some wrapping and helpers for the kind of REST operations we expect to perform. Some things like stream formatting is redundant visavi seastar, but on that level we only have \r\n encoded writing to output_stream and similar, which is less useful for things like logging.	2025-04-01 08:18:05 +00:00
Piotr Smaron	370707b111	service: restore default timeout in `announce_with_raft` This restored timeout seems to have been accidentally removed in `7081215552 (r2005352424)`. Without it, `raft_server_with_timeouts::run_with_timeout` will get `std::nullopt` as a value of the `timeout` parameter and perform an operation without any timeout, whereas previously it would have waited for the default timeout specified in `raft_server_for_group::default_op_timeout`. Closes scylladb/scylladb#23380	2025-04-01 10:20:16 +03:00
David Garcia	6e61fc323b	docs: redirect to docs.scylladb.com/manual/ Define a custom alert to redirect users to the latest version of the docs in https://docs.scylladb.com/manual/ Closes scylladb/scylladb#22636	2025-04-01 09:22:56 +03:00
Botond Dénes	bd9f51a29c	Merge 'transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing' from Vladislav Zolotarov A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver. However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then). This patch fixes this. Fixes #23173 The issue fixed by this PR is not critical but the fix is simple and safe enough so we should backport it to all live releases. Closes scylladb/scylladb#23174 * github.com:scylladb/scylladb: CQL Tracing: set common query parameters in a single function transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing	2025-04-01 09:16:02 +03:00
Pavel Emelyanov	b5a124f60c	sstable_directory: Move highest_generation_seen() to distributed_loader.cc This method is only used by the loader code (and tests). Also, There's the highest_version_seen() peer that sits in the loader code either. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23324	2025-04-01 09:15:14 +03:00
Pavel Emelyanov	eafc767cc6	sstable/filesystem: Add convenience helper to generate filename In its operations the fs storage carefully generates full filename from all sstable parameters -- version, format, generation, keyspace and table names and component type or name. However, in all of the cases format, version and keyspace:table names are inherited from the sstable being operated on. This calls for a filename generation helper that wraps most of the arguments thus making the lines shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23384	2025-04-01 09:14:44 +03:00
Botond Dénes	0fdf2a2090	Merge 'test/pylib: servers_add: support list of property_files' from Benny Halevy So that a multi-dc/multi-rack cluster can be populated in a single call. * Enhancement, no backport required Closes scylladb/scylladb#23341 * github.com:scylladb/scylladb: test/pylib: servers_add: add auto_rack_dc parameter test/pylib: servers_add: support list of property_files	2025-04-01 09:14:20 +03:00
Botond Dénes	94e8971308	scylla-gdb.py: improve scylla repairs commadn Make output more readable by: * group follower/master repair instances separately * split repair details into one line for repair summary, then one line for each host info * add indentation to make the output easier to follow Also add -m\|--memory option to calculate memory usage of repair buffers. Example output: (gdb) scylla repairs -m Repairs for which this node is leader: (repair_meta) 0x60503ab7f7b0: {id: 19197, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 30, memory: 48208512}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503717f7b0: {id: 19211, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 63863265}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: c4936a19-41da-4260-971e-651445d740fd, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished (repair_meta) 0x60502ddff7b0: {id: 19231, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::row_level_stop_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::row_level_stop_finished (repair_meta) 0x60501db3f7b0: {id: 19234, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 0, memory: 0}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_sync_boundary_started host: 039494b6-9d35-4f34-82c4-3c79c1d97175, shard: 4294967295, state: repair_state::get_sync_boundary_finished (repair_meta) 0x60501c81f7b0: {id: 19236, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 42696821}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::put_row_diff_with_rpc_stream_started (repair_meta) 0x60503f65f7b0: {id: 19238, table: large_collection_test.table_with_large_collection, reason: decommission, row_buf: {len: 0, memory: 0}, working_row_buf: {len: 28, memory: 47785163}, same_shard: True, tablet: False} host: 496e8b0c-50bf-4ada-b8f9-3d167138e908, shard: 5, state: repair_state::get_combined_row_hash_finished host: ce4413ab-33d9-40f8-b13e-d14af8511dda, shard: 4294967295, state: repair_state::get_row_diff_with_rpc_stream_finished Repairs for which this node is follower:	2025-04-01 01:53:35 -04:00
Botond Dénes	47c62a4cf2	scylla-gdb.py: seastar_lw_shared_ptr: add __nonzero__ and __bool__ There is currently no easy way to null-check seastar_lw_shared_ptr. Comparing get() against 0 doesn't work, if _p is null, get() will return an illegal pointer. So add methods to allow for easy null-checks by comparing _p with 0 instead.	2025-04-01 01:53:34 -04:00
Botond Dénes	f84bf43c96	scylla-gdb.py: introduce managed_bytes Extracted from managed_bytes_printer. Make working with managed_bytes easier. Abstracts how size and content is obtained.	2025-04-01 01:53:34 -04:00
Jenkins Promoter	6c528f5027	Update pgo profiles - aarch64	2025-04-01 04:45:44 +03:00
Jenkins Promoter	3c12029584	Update pgo profiles - x86_64	2025-04-01 04:27:11 +03:00
Michał Chojnowski	36be9d1c9b	docs: add user-facing documentation for SSTable compression with shared dicts	2025-04-01 00:07:31 +02:00
Michał Chojnowski	d33ffb221b	docs/dev: add sstable-compression-dicts.md	2025-04-01 00:07:31 +02:00
Michał Chojnowski	f851efd4fa	test: add test_sstable_compression_dictionaries_autotrain.py Adds a test which checks that sstable compression dict autotraining does its job.	2025-04-01 00:07:31 +02:00
Michał Chojnowski	62da3d8363	test: add test_sstable_compression_dictionaries_basic.py Add a basic integration test for SSTable compression with shared dictionaries.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	7b0eeefd79	test/pylib/rest_client: add `keyspace_upgrade_sstables` helper	2025-04-01 00:07:30 +02:00
Michał Chojnowski	3f7969313f	main: run a sstable_dict_autotrainer Create an instance of `sstable_dict_autotrainer` in `scylla_main` and run it.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	a19d6d95f7	api: add the estimate_compression_ratios API call Add an API call which estimates the effectiveness of possible compression config changes. This can be used to make an informed decision about whether to change the compression method, without actually recompressing any SSTables.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	4f0d453acf	dict_autotrainer: introduce sstable_dict_autotrainer Add a fiber responsible for periodic re-training of compression dictionaries (for tables which opted into dict-aware compression). As of this patch, it works like this: every `$tick_period` (15 minutes), if we are the current Raft leader, we check for dict-aware tables which have no dict, or a dict older than `$retrain_period`. For those tables, if they have enough data (>1GiB) for a training, we train a new dict and check if it's significantly better than the current one (provides ratio smaller than 95% of current ratio), and if so, we update the dict.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	9d02e2c005	db/system_keyspace: add query_dict_timestamp Adds a helper method which queries the creation timestamp of a given dict in `system.dicts`. We will later use the age of the current SSTable compression dict to decide if another training should be done already.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	cb1b291051	compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor Add new compressor names to `sstable_compression`. When those names are configured in the schema, new SSTables will be compressed with dict-aware Zstd or LZ4 respectively.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	bea866a46f	main: clean up sstable compression dicts after table drops When a table is dropped, its corresponding dictionary in `system.dicts` -- if any -- should be deleted, otherwise it will remain forever as garbage. This commit implements such cleanup.	2025-04-01 00:07:30 +02:00
Michał Chojnowski	cee504f66f	sstables/compress: discard hidden compression options after the decompressor is created Dictionary contents are kept in the list of "compression options" in the header of `CompressionInfo.db`, and they are loaded from disk into memory when the `sstable::compression` object is populated. After the decompressor for the SSTable is created based on those dict contents, they are not needed in RAM anymore. And since they take up a sizeable amount of memory, we would like to free them. In this patch, we discard all "hidden compression options" (currently: only the dictionary contents) from the `sstable::compression` object right after the decompressor is created. (Those options are not supposed to be used for anything else anyway).	2025-04-01 00:07:30 +02:00
Michał Chojnowski	10fa4abde7	compress: change compressor_ptr from shared_ptr to unique_ptr Cleanup patch. After we moved the ownership of compressors to sstables, compressor objects never have shared lifetime. `unique_ptr` is more appropriate for them than `shared_ptr` now. (And besides expressing the intent better, using `unique_ptr` prevents an accidental cross-shard `shared_ptr` copy).	2025-04-01 00:07:29 +02:00
Michał Chojnowski	58ae278d10	api: add the retrain_dict API call Add an API call which will retrain the SSTable compression dictionary for a given table. Currently, it needs all nodes to be alive to succeed. We can relax this later.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	4115a6fece	storage_service: add some dict-related routines storage_service will be the interface between the API layer (or the automatic training loop) and the dict machinery. This commit implements the relevant interface for that. It adds methods that: 1. Take SSTable samples from the cluster, using the new RPC verbs. 2. Train a dict on the sample. (The trainer will be plugged in from `main`). 3. Publishes the trained dictionary. (By adding mutations to Raft group 0). Perhaps this should be moved to a separate "service". But it's not like `storage_service` has a clear purpose anyway.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94d244ab49	main: in compression_dict_updated_callback, recognize and use SSTable compression dicts Currently, there is at most one dictionary in `system.dicts`: named "general", used by RPC compression. So the callback called on `system.dicts` just always refreshes the RPC compression dict. In a follow-up commit, we will publish SSTable compression dicts to `system.dicts` rows with a name in the "sstables/{table_uuid}" format. We want modification to such rows to be passed as new dictionary recommendations to the SSTable compressor factory. This commit teaches the `system.dicts` modification callback to recognize such modifications and forward them to the compressor factory.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	380f409c46	storage_service: add do_sample_sstables() Adds a helper which uses ESTIMATE_SSTABLE_VOLUME and SAMPLE_SSTABLES RPC calls to gather a combined sample of SSTable Data files for the given table from the entire cluster.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	94c33b6760	messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs Add two verbs needed to implement dictionary training for SSTable compression. SAMPLE_SSTABLES returns a list of randomly-selected chunks of Data files with a given cardinality and using a given chunk size, for the given table. ESTIMATE_SSTABLE_VOLUME returns the total uncompressed size of all Data files the given table.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	4856f4acca	db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict Extend the `system.dicts` helper for querying and modifying `system.dicts` with an ability to use names other than "general". We will use that in later commits to publish dictionaries for SSTable compression.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	b77c611c00	raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback Before this patch, `system.dicts` contains only one dictionary, for RPC compression, with the fixed name "general". In later parts of this series, we will add more dictionaries to system.dicts, one per table, for SSTable compression. To enable that, this patch adjusts the callback mechanism for group0's `write_mutations` command, so that the mutation callbacks for group0-managed tables can see which partition keys were affected. This way, the callbacks can query only the modified partitions instead of doing a full scan. (This is necessary to prevent quadratic behaviours.) For now, only the `system.dicts` callback uses the partition keys.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	d920ab5366	database: add sample_data_files() Add a helper for sampling the Data files for a given table. We will use it to take samples for dictionary training.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	48c06c7e4b	database: add take_sstable_set_snapshot() We want a method that will allow us to take a stable snapshot of SSTables, to asynchronously compute some stats on them. But `take_storage_snapshot` is overly invasive for that, because it flushes memtables on each call. (If `take_storage_snapshot` was, for example, called repetitively, it could create a ton of small memtables and lead to trouble). This commit adds a weaker version which only takes a snapshot of existing SSTables, and doesn't flush memtables by itself. This will be useful for dictionary training, which doesn't care about the semantics of SSTables, only their rough statistical properties.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	64f3d7e364	compress: teach `lz4_processor` about dictionaries Extend `lz4_processor` with the ability to use dictionaries. We won't use this ability yet. It will be used when new compressor names are added.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	b65101b371	compress: teach `zstd_processor` about dictionaries Extend `zstd_processor` with the ability to use dictionaries. We won't use this ability yet. It will be used when new compressor names are added.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	b18ddcb92e	sstables: delegate compressor creation to the compressor factory Remove `compressor::create()`. This enforces that compressors are only created through the `sstable_compressor_factory`. Unlike the synchronous `compressor::create()`, the factory will be able to create dict-aware compressors.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	30a9d471fa	sstables: plug an `sstable_compressor_factory` into `sstables_manager` Create a `sstable_compressor_factory_impl` in `scylla_main`, and pipe it through constructors into `sstables_manager`. In next commits, the factory available through the `sstables_manager` will be used to create compressors for SSTable readers and writers.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	ebf02913a2	sstables: introduce sstable_compressor_factory Before this commit, `compressor` objects are synchronously created, during the creation or opening of SSTables, from `compression_parameters` objects. But we want to add compression dictionaries to SSTables and we want to share dictionary contents across shards. To do that, we need to make the creation of `compressor` objects asynchronous, and give it access to a global dictionary registry. We encapsulate that in a `sstable_compression_factory`. Instead of calling `compressor::create()` on SSTable opening or creation, we will ask the factory, asynchronously, for a new compressor, and it will return a compressor with a deduplicated, up-to-date dictionary. This commit introduces such a factory. It's not used anywhere yet, and the compressors it produces don't use the provided dictionaries yet.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	2bd393849c	utils/hashers: add get_sha256() Add a helper function which computes the SHA256 for a blob. We will use it to compute identifiers for SSTable compression dictionaries later.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	61316e29df	gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature This feature will guard against writing SSTables containing compression dictionaries before the entire cluster is able to understand them.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	dd932ebb2f	compress: add hidden dictionary options Before this commit, "compression options" written into CompressionInfo.db (and used to construct a decompressor) have a 1:1 correspondence to "compression options" specified in the schema. But we want to add a new "compression option" -- the compression dictionary -- which will be written into CompressionInfo.db and used to construct decompressors, but won't be specified in the schema. To reconcile that, in this commit we introduce the notion of a "hidden option". If an option name in `CompressionInfo.db` begins with a dot, then this option will be used to construct decompressors, but won't be visible for other uses. (I.e. for the `sstable_info` API call and for recovering a fake `schema` from `CompressionInfo.db` in the `scylla sstable` tool). Then, we introduce the hidden `.dictionary.{0,1,2,..}` options, which hold the contents of the dictionary blob for this SSTable. (The dictionary is split into several parts because the SSTable format limits the length of a single option value to 16 bits, and dictionaries usually have a length greater than that). This commit only introduces helpers which translate dictionary blobs into "options" for CompressionInfo.db, and vice-versa, but it doesn't use those helpers yet. They will be used in later commits.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	11be7c0704	compress: remove `compression_parameters::get_compressor()` Following up on the previous commits, we avoid constructing compressors where not necessary, by checking things directly on `compression_parameters` instead.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	006c631642	sstables/compress: remove get_sstable_compressor() Following up on the previous commit, we avoid constructing a compressor in the `sstable_info` API call, and we instead read the compression options from the `sstable::compression`.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	8e611536b0	sstables/compress: move ownership of `compressor` to `sstable::compression` SSTable readers and writers use `compressor` objects to compress and decompress chunks of SSTable data files. `compressor` objects are read-only, so only one of them is needed for each SSTable. Before this commit, each reader and writer has its own `compressor` object. This isn't necessary, but it's okay. But later in this series it will stop being okay, because the creation of a `compressor` will become an expensive cross-shard operation (because it might require sharing a compression dictionary from another shard). So we have to adjust the code so that there is only once `compressor` per sstable, not one per reader/writer. We stuff the ownership of this compressor into `sstable::compression`. To make the ownership clear, we remove `compression_ptr` shared pointers from readers and writers, and make them access the compressor via the `sstable::compression` instead.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	7bdcd5e8c1	compress: remove compressor::option_names() It used to be used by `compression_parameters` validation logic to ask the created `compressor` for compressor-specific option names. Since we no longer delegate this to `compressor`, but we just put the knowledge of those options directly into `compressor_parameters`, it's dead code now.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	3b0ab8e1ee	compress: clean up the constructor of zstd_processor Since we now parse and validate the compression level during the construction of `compression_parameters`, we can just pass the structured params to `zstd_processor` instead of passing a raw string map.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	6470035a74	compress: squash zstd.cc into compress.cc Unlike all other implementations of `compressor`, `zstd_processor` has its own special object file and its own special late binding mechanism (via the `class_registry`). It doesn't need either. Let's squash it into `compress.cc`. Keeping `zstd_processor` a separate "module" would require adding even more headers and source files later in the series (when adding dictionaries), and there's no benefit in being so granular. All `compressor` logic can be in `compress.cc` and it will still be small enough. This commit also gets rid of the pointless `class_registry` late binding mechanism and just constructs the `zstd_processor` in `compressor::create()` with a regular constructor call.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	cfe69e057f	sstables/compress: break the dependency of `compression_parameters` on `compressor` Note: this commit is meant to be a code refactoring only and is not intended to change the observable behaviour. Today `schema` contains a `compression_parameters`. `compression_parameters` contains an instance of `compressor`, and SSTable writers just share that instance. This is fine because `compressor` is a stateless object, functionally dependent on the schema. But in later parts of the series, we will break this functional dependency by adding dictionaries to compressors. Two writers for the same schema might have different dictionaries, so they won't be able to just share a single instance contained in the schema. And when that happens, having a `compressor` instance in the `schema`/`compression_parameters` will become awkward, since it won't be actually used. It will be only a container for options. In addition, for performance reasons, we will want to share some pieces of compressors across shards, which will require -- in the general case -- a construction of a compressor to be asynchronous, and therefore not possible inside the constructor of `compression_parameters`. This commit modifies `compression_parameters` so that it doesn't hold or construct instances of `compressor`. Before this patch, the `compressor` instance constructed in `compression_parameters` has an additional role of validating and holding compressor-specific options. (Today the only such option is the zstd compression level). This means that the pieces of logic responsible for compressor-specific options have to be rewritten. That ends up being the bulk of this commit.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	f4ca94d13b	compress.hh: switch compressor::name() from an instance member to a virtual call Before this patch, `compressor` is designed to be a proper abstract class, where the creator of a compressor doesn't even know what he's creating -- he passes a name, and it gets turned into a `compressor` behind a scenes. But later, when creation of compressors will involve looking up dictionaries, this abstraction will only get in the way. So we give up on keeping `compressor` abstract, and instead of using "opaque" names we turn to an explicit enum of possible compressor types. The main point of this patch is to add the `algorithm` enum and the `algorithm_to_name()` function. The rest of the patch switches the `compressor::name()` function to use `algorithm_to_name()` instead of the passed-by-constructor `compressor::_name`, to keep a single source of truth for the names.	2025-04-01 00:07:27 +02:00
Michał Chojnowski	4f634de2e9	bytes: adapt fmt_hex to std::span<const std::byte> This allows us to hexdump things other than `bytes_view`. (That is, without reinterpret_casting them to `bytes_view`, which -- aside from the inconvenience -- isn't quite legal. In contrast, any span can be legally casted to `std::span<const std::byte>`).	2025-04-01 00:07:27 +02:00
Robert Bindar	b647196121	Remove db::config::object_storage_config That map became redundant once we added object_storage_endpoints in the config, this patch removes it and switches all the user code to use the new option. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 17:15:12 +03:00
Gleb Natapov	3abe5de8bf	gossiper: make examine_gossiper private	2025-03-31 16:50:50 +03:00
Gleb Natapov	afdfde8300	gossiper: rename get_nodes_with_host_id to get_node_ip Also change it to return std::optional instead of std::set since now there can be only on ip mapped to an id.	2025-03-31 16:50:50 +03:00
Gleb Natapov	28fb84117d	treewide: drop id parameter from gossiper::for_each_endpoint_state We have it in endpoint_state anyway, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	4609bbbbb2	treewide: move gossiper to index nodes by host id This patch changes gossiper to index nodes by host ids instead of ips. The main data structure that changes is _endpoint_state_map, but this results in a lot of changes since everything that uses the map directly or indirectly has to be changed. The big victim of this outside of the gossiper itself is topology over gossiper code. It works on IPs and assumes the gossiper does the same and both need to be changed together. Changes to other subsystems are much smaller since they already mostly work on host ids anyway.	2025-03-31 16:50:50 +03:00
Gleb Natapov	19ac05b0ba	gossiper: drop ip from replicate function parameters We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	c5b8429bec	gossiper: drop ip from apply_new_states parameters We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	6da5f541a2	gossiper: drop address from handle_major_state_change parameter list We have it in endpoint_state now, so no need to pass both.	2025-03-31 16:50:50 +03:00
Gleb Natapov	5e06bf76e0	gossiper: pass rpc::client_info to gossiper_shutdown verb handler It will be needed later to obtain host id of the peer.	2025-03-31 16:50:50 +03:00
Gleb Natapov	704580b197	gossiper: add try_get_host_id function The function returns unengaged std::optional if id is not found instead of throwing like get_host_id does.	2025-03-31 16:50:45 +03:00
Tomasz Grabiec	29d1c2adc6	Merge 'Finalize tablet splits earlier' from Lakshmi Narayanan Sreethar Resize finalization is executed in a separate topology transition state, `tablet_resize_finalization`, to ensure it does not overlap with tablet transitions. The topology transitions into the `tablet_resize_finalization` state only when no tablet migrations are scheduled or being executed. If there is a large load-balancing backlog, split finalization might be delayed indefinitely, leaving the tables with large tablets. This PR fixes the issue by updating the load balancer to no schedule any migrations and to not make any repair plans when there a resize finalization is pending in any table. Also added a testcase to verify the fix. Fixes #21762 Improvement : No need to backport. Closes scylladb/scylladb#22148 * github.com:scylladb/scylladb: topology_coordinator: fix indentation in generate_migration_updates topology_coordinator: do not schedule migrations when there are pending resize finalizations load_balancer: make repair plans only when there is no pending resize finalization	2025-03-31 14:42:34 +02:00
Gleb Natapov	6999b474a1	gossiper: add ip to endpoint_state Store endpoint's IP in the endpoint state. Currently it is stored as a key in gossiper's endpoint map, but we are going to change that. The new filed is not serialized when endpoint state is sent over rpc, so it is set by the rpc handler from the value in the map that is in the rpc message. This map will not be changed to be host id based to not break interoperability.	2025-03-31 15:42:08 +03:00
Gleb Natapov	9bb2edcae6	serialization: fix std::map de-serializer to not invoke value's default constructor	2025-03-31 15:42:07 +03:00
Gleb Natapov	e5cc3b75f8	gossiper: drop template from wait_alive_helper function Move ip to id translation to the caller.	2025-03-31 15:42:07 +03:00
Gleb Natapov	0dd86b4f1d	gossiper: move get_supported_features and its users to host id	2025-03-31 15:42:07 +03:00
Gleb Natapov	f97bb6922d	storage_service: make candidates_for_removal host id based	2025-03-31 15:42:07 +03:00
Gleb Natapov	82491cec19	gossiper: use peers table to detect address change This requires serializing entire handle_state_normal with a lock since it both reads and updates peers table now (it only updated it before the change). This is not a big deal since most of it is already serialized with token metadata lock. We cannot use it to serialize peers writes as well since the code that removes an endpoint from peers table also removes it from gossiper which causes on_remove notification to be called and it may take the metadata lock as well causing deadlock.	2025-03-31 15:41:44 +03:00
Tomasz Grabiec	6bff596fce	tablets: Make tablet allocation equalize per-shard load Before, it was equalizing per-node load (tablet count), which is wrong in heterogenous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378	2025-03-31 14:34:30 +02:00
Gleb Natapov	1c2a9257e9	storage_service: use std::views::keys instead of std::views::transform that returns a key	2025-03-31 15:25:39 +03:00
Gleb Natapov	a581a99dbf	gossiper: move _pending_mark_alive_endpoints to host id Index _pending_mark_alive_endpoints map by host id instead of ip	2025-03-31 15:25:39 +03:00
Gleb Natapov	555149c153	gossiper: do not allow to assassinate endpoint in raft topology mode It does nothing but harm in raft topology mode.	2025-03-31 15:25:39 +03:00
Gleb Natapov	4cc1c10035	gossiper: fix indentation after previous patch	2025-03-31 15:25:39 +03:00
Gleb Natapov	e8b7aaa0d4	gossiper: do not allow to assassinate non existing endpoint We assume that all endpoint states have HOST_ID set or the host id is available locally, but the assassinate code injects a state without HOST_ID for not existing endpoint violating this assumption.	2025-03-31 15:25:39 +03:00
Botond Dénes	90c20858ed	Merge 'test/database: Remove most of take_snapshot() helper overloads and re-use them more' from Pavel Emelyanov This helper facilitate snapshot creation by various test cases in database_test.cc. This PR generalizes all overloads into one that suits all callers and patches one more test case to use it as well. Closes scylladb/scylladb#23482 * github.com:scylladb/scylladb: test/database: Re-use take_snapshot() helper once more test/database: Remove most of take_snapshot() helper overloads	2025-03-31 15:20:51 +03:00
Benny Halevy	5f2ce0b022	loading_cache_test: test_loading_cache_reload_during_eviction: use manual_clock Rather than lowres_clock, as since `32b7cab917`, loading_cache_for_test uses manual_clock for timing and relying on lowres_clock to time the test might run out of memory on fast test machines. Fixes #23497 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23498	2025-03-31 14:53:06 +03:00
Robert Bindar	e3a3508960	Move `object_storage.yaml` endpoints to `scylla.yaml` This change also removes the `object_storage.yaml` file altogether and adds tests for fetching the endpoints via the `v2/config/object_storage_endpoints` REST api. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-03-31 13:39:39 +03:00
Pavel Emelyanov	ac582efb44	test/database: Re-use take_snapshot() helper once more There's a test case that can call the recently patched take_snapshot() helper as well. This changes nothing, but makes further patching a bit simpler (not in this branch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-31 13:18:06 +03:00
Pavel Emelyanov	7e6380b6bd	test/database: Remove most of take_snapshot() helper overloads There are 3 of those that help tests (re)shuffle cql_test_env/database, skip_flush == true/false options and keyspace/table/snapshot names. There's little sense in having that many of those, just one overload with default arguments suits most of the callers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-31 13:18:06 +03:00
Botond Dénes	ea55eed037	Merge 'Snapshot several tables at once in scrub API handler' from Pavel Emelyanov The scrub API handler may want to snapshot several tables. For that, it calls snapshot-ctl method to snapshot a single table for each table in the list. That's excessive, snapshot-ctl has a method to snapshot a bunch of tables at once, just what the scrub handler needs. It's an improvement, so no need to backport Closes scylladb/scylladb#23472 * github.com:scylladb/scylladb: snapshot-ctl: Remove unused snapshot-single-table method api: Snapshot all tables at once in scrub handler	2025-03-31 13:00:32 +03:00
Piotr Smaron	aff8cbc6f3	CODEOWNERS: remove expired owners Removing krzaq, who's no longer with the company. Removing core-frontend team members from Alternator areas, as it's no longer the domain of this team. Closes scylladb/scylladb#23500	2025-03-31 11:37:51 +03:00
Pavel Emelyanov	0077acd1bb	api: Properly validate table in tablet add\|del replica handlers The handlers in question just go and call database.find_column_family, in case the table in question doesn't exist, the no_such_column_family exception would be thrown, which is not nice. Proper behavior is to throw bad_param one and there's a helper that does it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23389	2025-03-31 10:03:17 +02:00
Andrzej Jackowski	c89d8c6566	cql3: prevent from empty option use in cf_statement::column_family() Implementation of cf_statement::column_family() dereferences _cf_name option without checking if the option is non-empty. On enterprise branch, there is a safeguard that prevents from such an empty option dereferencing. Although the current code on master seems to not call columny_family() when _cf_name is empty, it is safer to introduce the same workaround on master, to avoid any regression. This change: - Prevent from empty option use in cf_statement::column_family() Fixes: scylla-enterprise#5273 Closes scylladb/scylladb#23366	2025-03-31 09:43:22 +03:00
Michał Chojnowski	e23fdc0799	table: fix a race in table::take_storage_snapshot() `safe_foreach_sstable` doesn't do its job correctly. It iterates over an sstable set under the sstable deletion lock in an attempt to ensure that SSTables aren't deleted during the iteration. The thing is, it takes the deletion lock after the SSTable set is already obtained, so SSTables might get unlinked before we take the lock. Remove this function and fix its usages to obtain the set and iterate over it under the lock. Closes scylladb/scylladb#23397	2025-03-31 09:40:32 +03:00
Avi Kivity	2b9e1e61d0	docs: reader_concurrency_semaphore: document CPU concurrency limit Document the CPU concurrency implemented in `3d816b7c16` and adjusted in `3d12451d1f`. Closes scylladb/scylladb#23404	2025-03-31 09:39:55 +03:00
Dawid Mędrek	b0b0c5905e	test/cluster/test_multidc: Clean up RF-rack-valid keyspaces tests There are some minor things we should fix that are a remnant of the original changes (scylladb/scylladb@7646e14). Closes scylladb/scylladb#23429	2025-03-31 09:38:42 +03:00
David Garcia	1a7be07b8c	docs: renders os-support from json file docs: renders os-support from json file Closes scylladb/scylladb#23436	2025-03-31 09:36:49 +03:00
Marcin Maliszkiewicz	e3f2ebd4fb	cql3: remove not needed cmd copy in indexed_table_select_statement It's not used variable. There should be a tiny perf increase as it saves allocation. Closes scylladb/scylladb#23473	2025-03-31 09:34:32 +03:00
Avi Kivity	73e4a3c581	sstables: store features early in write path sstable features indicate that an sstable has some extension, or that some bug was fixed. They allow us to know if we can rely on certain properties in a read sstables. Currently, sstable features are set early in the read path (when we read the scylla metadata file) and very late in the write path (when we write the scylla metadata file just before sealing the sstable). However, we happen to read features before we set them in the write path - when we resize the bloom filter for a newly written sstable we instantiate an index reader, and that depends on some features. As a result, we read a disengaged optional (for the scylla metadata component) as if it was engaged. This somehow worked so far, but fails with libstdc++ hash table implementation. Fix it by moving storage of the features to the sstable itself, and setting it early in the write path. Fixes #23484 Closes scylladb/scylladb#23485	2025-03-31 09:33:56 +03:00
Pavel Emelyanov	693387bda6	Merge 'test.py: topology: allow to run tests with bare pytest command' from Evgeniy Naydanov Add possibility to run topology tests using bare pytest command. To achieve this goal the following changes were made: - Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py`. - To build `TestSuite` object we need to discover a corresponding `suite.xml` file. Do this by walking up thru the fs tree starting from the current test file. - Run ScyllaClusterManager using pytest fixture if `--manager-api` option is not provided. And made some refactoring: - Add path constants to `test` module and use them in different test suites instead of own dups of the same code: - TOP_SRC_DIR : ScyllaDB's source code root directory - TEST_DIR : the directory with test.py tests and libs - BUILD_DIR : directory with ScyllaDB's build artifacts - Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself. - Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog` - Refactor `ResourceGather` classes to use path from a `test` object instead of providing it separately. - Move modes constants (`all_modes`/`ALL_MODES` and `debug_modes`/`DEBUG_MODES`) to `test` module and remove duplication. - Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to`pylib.suite.base` to avoid circular imports. - In some places refactor to use f-strings for formatting. Also minor changes related to running with pytest-xdist: - When run tests in parallel we need to ensure that filenames are unique by adding xdist worker ID to them. - Pass random seed across xdist workers using env variable. Closes scylladb/scylladb#22960 github.com:scylladb/scylladb: test.py: async_cql: remove unused event_loop fixture test.py: random_failures: make it play well with xdist test.py: add xdist worker ID to log filenames test.py: topology: run tests using bare pytest command test.py: add fixtures for current test suite and test test.py: refactor paths constants and options	2025-03-31 09:30:06 +03:00
Benny Halevy	a4aa4d74c1	test/pylib: servers_add: add auto_rack_dc parameter To quickly populate nodes in a single dc, each node in its own rack. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-30 19:23:40 +03:00
Benny Halevy	c4dbb11c87	test/pylib: servers_add: support list of property_files So that a multi-dc/multi-rack cluster can be populated in a single call. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-30 19:12:39 +03:00
Piotr Smaron	a2bbbc6904	auth: forbid modifying system ks by non-superusers Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces. Fixes: scylladb/scylladb#23218 Closes scylladb/scylladb#23219	2025-03-30 16:55:04 +03:00
Ferenc Szili	2c9b312b58	test: port of test and reproducer for resurrection during file based streaming This change ports test/cluster/test_resurrection.py from enterprise to master. Because the underlying issue deals with file based streaming, this test was a part of the enterprise repo. It contains the test and reproducer for the issue described below: When tablets are migrated with file-based streaming, we can have a situation where a tombstone is garbage collected before the data it shadows lands. For instance, if we have a tablet replica with 3 sstables: 1 sstable containing an expired tombstone 2 sstable with additional data 3 sstable containing data which is shadowed by the expired tombstone in sstable 1 If this tablet is migrated, and the sstables are streamed in the order listed above, the first two sstables can be compacted before the third sstable arrives. In that case, the expired tombstone will be garbage collected, and data in the third sstable will be resurrected after it arrives to the pending replica. The fix for the issue was merged in `b66479ea98` This patch only ports the missing test. Closes scylladb/scylladb#23466	2025-03-30 13:39:40 +03:00
Andrzej Jackowski	b8adbcbc84	audit: fix empty query string in BATCH query Function modification_statement::add_raw() is never called, which makes query string in audit_info of batch queries empty. In enterprise branch, add_raw is called in Cql.g and those changes were never merged to master. This changes: - Add missing call of add_raw() to Cql.g - Include other related changes (from PR#3228 in scylla-enterprise) Fixes scylladb#23311 Closes scylladb/scylladb#23315	2025-03-30 13:37:11 +03:00
Michał Chojnowski	79a477ecb6	cmake: add the `-dynamic-linker=...` form to the -dynamic-linker regex On my system (Nix), the compiler produces a `-dynamic-linker=/nix/store/...` in the linker call scanned by get_padded_dynamic_linker_option. But the regex can't deal with the `=` there, it requires a ` `. Fix that. We also do the same in configure.py, and remove the Nix-specific hack which used to disable the entire mechanism. Closes scylladb/scylladb#22308	2025-03-30 11:58:47 +03:00
Kefu Chai	7814f6d374	github: improve seastar bad include check for better developer experience: - add inline annotations using problem matchers, see https://github.com/actions/toolkit/blob/main/docs/problem-matchers.md - use a single step for uploading both output files, because the `path` setting is actually passed to [@actions/glob](https://github.com/actions/toolkit/tree/main/packages/glob), i removed the double quotes and the leading "./" from the paths. - use "::error" workflow command to signify the failure, see https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/workflow-commands-for-github-actions#example-creating-an-annotation-for-an-error Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23310	2025-03-30 11:56:18 +03:00
Evgeniy Naydanov	1a0c14aa50	test.py: async_cql: remove unused event_loop fixture Newer version of pytest-asyncio (0.24.0) allows to control the scope of async loop per fixture. Don't need this workaround anymore.	2025-03-30 03:19:30 +00:00
Evgeniy Naydanov	cac0257914	test.py: random_failures: make it play well with xdist Pass random seed across xdist workers using env variable.	2025-03-30 03:19:30 +00:00
Evgeniy Naydanov	9bba59631f	test.py: add xdist worker ID to log filenames When run tests in parallel we need to ensure that filenames are unique by adding xdist worker ID to them.	2025-03-30 03:19:30 +00:00
Evgeniy Naydanov	9cb0ec2b42	test.py: topology: run tests using bare pytest command Run ScyllaClusterManager using pytest fixture if `--manager-api` option is not provided. On this stage we're trying to be as close to test.py as possible. test.py runs tests file-by-file, so, effectively, scopes `session`, `package`, and `module` are pretty same. Also, test.py starts ScyllaClusterManager for every test module and this is the reason why fixture `manager_api_sock_path` has scope=`module`. And, in result, we need to change scope for fixture `manager_internal` too.	2025-03-30 03:19:29 +00:00
Evgeniy Naydanov	42075170d1	test.py: add fixtures for current test suite and test Add fixtures `testpy_testsuite` and `testpy_test` to `test/conftest.py` To build TestSuite object we need to discover a corresponding `suite.xml` file. Do this by walking up thru the fs tree starting from the current test file.	2025-03-30 03:19:29 +00:00
Evgeniy Naydanov	c4ae4e247a	test.py: refactor paths constants and options Add path constants to `test` module and use them in different test suites instead of own dups of the same code: - TOP_SRC_DIR : ScyllaDB's source code root directory - TEST_DIR : the directory with test.py tests and libs - BUILD_DIR : directory with ScyllaDB's build artefacts Add TestSuite.log_dir attribute as a ScyllaDB's build mode subdir of a path provided using `--tmpdir` CLI argument. Don't use `tmpdir` name because it mixed up with pytest's built-in fixture and `--tmpdir` option itself. Change default value for `--tmdir` from `./testlog` to `TOP_SRC_DIR/testlog` Refactor `ResourceGather*` classes to use path from a `test` object instead of providing it separately. Move modes constants to `test` module and remove duplications. Move `prepare_dirs()` and `start_3rd_party_services()` from `pylib.util` to `pylib.suite.base` to avoid circular imports (with little refactoring to use `pathlib.Path` instead of `str` as paths.) Also, in some places refactor to use f-strings for formatting.	2025-03-30 03:19:29 +00:00
Michał Jadwiszczak	0ee0696959	test/cqlpy/test_service_level_api: update to service levels on raft and remove flakiness Tests in `test_service_level_api` were written before scylladb/scylladb#16585 and they were doing 10s sleeps to wait for service level controller to update its configuration. Now performing a read barrier is sufficient to ensure SL configuration is up-to-date, which significantly reduces tests time (from ~60s to ~2-3s). Moreover, there was flakiness in the `test_switch_tenants` test. Until now, the test waited up to 60s for the connections to update their scheduling groups. However, it is difficult to determine how long the process might take because a connection may be blocked while waiting for the next request to be processed, and the scheduling group will be updated only after a request is processed (see `generic_server::connection::process_until_tenant_switch()`). To address this issue, 100 simple queries are executed so that connections on all shards process at least one request and update their scheduling groups. Fixes scylladb/scylladb#22768 Closes scylladb/scylladb#23381	2025-03-28 17:14:21 +03:00
Pavel Emelyanov	9aa986a49a	snapshot-ctl: Remove unused snapshot-single-table method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-28 10:45:31 +03:00
Pavel Emelyanov	5162f75d0b	api: Snapshot all tables at once in scrub handler The handler walks the list of tables and snapshots each one individually (if needed). That's not very optimal, each such call starts a "snapshot modification operation", which is switching to shard-0 for a lock, then calls the snapshot of multiple tables giving it vector of a single name. There's a method of snapshot-ctl that snapshots several tables at once, no need to open-code it here. One thing to care about -- the take_column_family_snapshot() throws when the vector of table names is empty, so need an explicit skipping check. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-28 10:44:47 +03:00
Avi Kivity	6d7cb68aab	test: ldap: avoid io_uring Seastar reactor backend It tends to fail sometimes with ENOMEM: ``` ERROR 2025-03-24 01:05:22,983 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found) ERROR 2025-03-24 01:05:30,984 [shard 0:sl:d] ldap_role_manager - error in reconnect: std::system_error (error C-Ares:4, server.that.will.never.exist.scylladb.com: Not found) ERROR 2025-03-24 01:05:47,123 [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:12, Cannot allocate memory) ERROR 2025-03-24 01:05:47,139 [shard 0:main] table - failed to write sstable /scylladir/testlog/x86_64/debug/scylla-33787f64/system_schema/view_virtual_columns-08843b6345dc3be29798a0418295cfaa/me-3got_1s5n_0lfls1y4z7vkkts07a-big-Data.db: storage_io_error (Storage I/O error: 12: Cannot allocate memory) ERROR 2025-03-24 01:05:47,140 [shard 0:main] table - Memtable flush failed due to: storage_io_error (Storage I/O error: 12: Cannot allocate memory). Aborting, at 0x30f5605 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514f14 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4514b96 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x45165b1 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x4518dcf 0x3fde842 0x35dc5c6 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674 0x314b8b6 /lib64/libc.so.6+0x70ba7 /lib64/libc.so.6+0xf4b8b -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)> >(seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, seastar::noncopyable_function<seastar::future<void> (seastar::future<void>&&)>&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> -------- seastar::shared_future<>::shared_state Aborting on shard 0, in scheduling group main. Backtrace: 0x30f5605 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x384a0e4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x3849db2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x369bd84 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d42a2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a5ed9 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a61d5 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x37a601f /lib64/libc.so.6+0x1a04f /lib64/libc.so.6+0x72b53 /lib64/libc.so.6+0x19f9d /lib64/libc.so.6+0x1941 0x3fde8b1 0x35dc5c6 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36c26ed /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36cdd0c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d2cd2 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x36d0e56 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327f47a /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x327c8f0 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1cdd4 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c79c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c69c /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar_testing.so+0x1c184 /jenkins/workspace/scylla-master/next/scylla/build/debug/seastar/libseastar.so+0x34b2674 0x314b8b6 /lib64/libc.so.6+0x70ba7 /lib64/libc.so.6+0xf4b8b === TEST.PY SUMMARY START === Test exited with code -6 === TEST.PY SUMMARY END === === decoded === Backtrace: [Backtrace #0] __interceptor_backtrace at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4369 void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/debug/seastar/./seastar/include/seastar/util/backtrace.hh:70 seastar::backtrace_buffer::append_backtrace() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:805 seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:838 seastar::print_with_backtrace(char const, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:850 seastar::sigabrt_action() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:4004 seastar::install_oneshot_signal_handler<6, (void ()())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t, void)#1}::operator()(int, siginfo_t, void) const at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3981 seastar::install_oneshot_signal_handler<6, (void ()())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t, void)#1}::__invoke(int, siginfo_t, void) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3976 /lib64/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=c8c3fa52aaee3f5d73b6fd862e39e9d4c010b6ba, for GNU/Linux 3.2.0, not stripped ?? ??:0 printf_positional at ??:? ?? ??:0 ?? ??:0 replica::table::seal_active_memtable(replica::compaction_group&, replica::flush_permit&&)::$_0::operator()(std::function<seastar::future<void> ()>) const at ././replica/table.cc:1512 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:122 seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:2616 seastar::reactor::run_some_tasks() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3088 seastar::reactor::do_run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3256 seastar::reactor::run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3146 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:276 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:167 seastar::testing::test_runner::start_thread(int, char)::$_0::operator()() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/testing/test_runner.cc:77 void std::__invoke_impl<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>(std::__invoke_other, seastar::testing::test_runner::start_thread(int, char)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 std::enable_if<is_invocable_r_v<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>, void>::type std::__invoke_r<void, seastar::testing::test_runner::start_thread(int, char)::$_0&>(seastar::testing::test_runner::start_thread(int, char)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111 std::_Function_handler<void (), seastar::testing::test_runner::start_thread(int, char)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 seastar::posix_thread::start_routine(void) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/posix.cc:90 asan_thread_start(void*) at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/asan_interceptors.cpp:239 __vfscanf_internal at :? peek_token at ??:? ``` In `ce65164315`, we banned io_uring from tests, but missed the ldap tests. This extends coverage to ldap tests. I verified that the new options indeed reach the test. Refs #23411. Credit to Botond for recognizing the failure reason. Closes scylladb/scylladb#23422	2025-03-28 07:45:53 +02:00
Tomasz Grabiec	d6232a4f5f	tablets: load_balancer: Fix reporting of total load per node Load is now utilization, not count, so we should report average per-shard load, which is equivalent to node's utilization.	2025-03-27 23:28:20 +01:00
Botond Dénes	bd8973a025	tools/scylla-nodetool: s/GetInt()/GetInt64()/ GetInt() was observed to fail when the integer JSON value overflows the int32_t type, which `GetInt()` uses for storage. When this happens, rapidjson will assign a distinct 64 bit integer type to the value, and attempting to access it as 32 bit integer triggers the wrong-type error, resulting in assert failure. This was hit on the field where invoking nodetool netstats resulted in nodetool crashing when the streamed bytes amounts were higher than maxint. To avoid such bugs in the future, replace all usage of GetInt() in nodetool of GetInt64(), just to be sure. A reproducer is added to the nodetool netstats crash. Fixes: scylladb/scylladb#23394 Closes scylladb/scylladb#23395	2025-03-27 14:05:39 +02:00
Botond Dénes	d57e71837f	Merge 'Improve scoped restore test' from Pavel Emelyanov This PR includes several fixes to the nowadays flaky test_restore_with_streaming_scopes test. 1. Check that backup and restore APIs don't fail. Currently, if either of them does the test cases fails anyway checking that the data is not restored back, but it's better to know what exactly failed 2. For restore API the test collects the list of sstables to restore from. Currently collecting this list races with background compaction and sometimes leads to restore API to fail which, in turn, makes the whole test to fail 3. Add a test case that validates that restore-from-missing-sstable fails nicely refs: #23189 No backport, as it's a relatively new test Closes scylladb/scylladb#23445 * github.com:scylladb/scylladb: test/backup: Validate that restoring from non-existing sstables fails test/backup: Collect sstables names after snapshot test/backup: Check that backup and restore succeed	2025-03-27 13:23:41 +02:00
Piotr Dulikowski	288216a89e	Merge 'Ignore wrapped exceptions `gate_closed_exception` and `rpc::closed_error` when node shuts down.' from Sergey Zolotukhin Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325 Fixes scylladb/scylladb#23305 Fixes scylladb/scylladb#21815 Backport: looks like this is quite a frequent issue, therefore backport to 2025.1. Closes scylladb/scylladb#23336 * github.com:scylladb/scylladb: database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` database: Unify exception handling in `do_apply` and `apply_with_commitlog` storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-27 11:39:42 +01:00
Pavel Emelyanov	9f036d957a	Merge 'test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables' from Botond Dénes Filter out sstables which don't have a TOC or have a temporary TOC. Such sstables are incomplete and can dissapear if the compaction which writes them is interrupted. Fixes: #23203 This PR fixes a flaky test which is only on master, no backports required. Closes scylladb/scylladb#23450 * github.com:scylladb/scylladb: test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables	2025-03-27 09:45:07 +03:00
Tomasz Grabiec	8e506c5a8f	test: tablets: Fix flakiness due to ungraceful shutdown The test fails sporadically with: cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1} That's becase a server is stopped in the middle of the workload. The server is stopped ungracefully which will cause some requests to time out. We should stop it gracefully to allow in-flight requests to finish. Fixes #20492 Closes scylladb/scylladb#23451	2025-03-27 09:44:07 +03:00
Lakshmi Narayanan Sreethar	dccce670c1	topology_coordinator: fix indentation in generate_migration_updates Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar	5b47d84399	topology_coordinator: do not schedule migrations when there are pending resize finalizations Resize finalization is executed in a separate topology transition state, `tablet_resize_finalization`, to ensure it does not overlap with tablet transitions. The topology transitions into the `tablet_resize_finalization` state only when no tablet migrations are scheduled or being executed. If there is a large load-balancing backlog, split finalization might be delayed indefinitely, leaving the tables with large tablets. To fix this, do not schedule tablet migrations on any tables when there are pending resize finalizations. This ensures that migrations from the same table and other unrelated tables do not block resize finalization. Also added a testcase to verify the fix. Fixes #21762 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-03-27 10:16:34 +05:30
Lakshmi Narayanan Sreethar	8cabc66f07	load_balancer: make repair plans only when there is no pending resize finalization Do not make repair plans if any table has pending resize finalization. This is to ensure that the finalization doesn't get delayed by reapir tasks. Refs #21762 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-03-27 10:16:34 +05:30
Avi Kivity	b292b5800b	Merge 'test.py: move starting LDAP service to dedicate method' from Andrei Chekun Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services. Fix LDAP tests flakiness due not possible to connect to LDAP server. Add catching stdout and stderr of toxiproxy-cli in case of errors Related: https://github.com/scylladb/scylladb/pull/23333 This PR is based on https://github.com/scylladb/scylladb/pull/23221, so #23221 should be merged first. Closes scylladb/scylladb#23235 * github.com:scylladb/scylladb: test.py: Refactor nodetool/conftest test.py: Refactor test/pylib/cpp/ldap test.py: move starting LDAP service to dedicate method	2025-03-26 15:31:00 +02:00
Botond Dénes	801339bad9	test/cqlpy/test_tools.py: test_scylla_sstable_query: reduce scope of no-compaction context To just system.local, the table these tests operate on. No need to disable autocompaction for all of the system keyspace.	2025-03-26 09:19:38 -04:00
Botond Dénes	3ec863c4ce	test/clqpy/test_tool.py: get_sstables_for_table(): exclude non-sealed sstables Filter out sstables which don't have a TOC or have a temporary TOC. Such sstables are incomplete and can dissapear if the compaction which writes them is interrupted.	2025-03-26 09:18:34 -04:00
Pavel Emelyanov	1da889f239	Merge 'Allow abort during join_cluster' from Benny Halevy Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 * requires backport on top of https://github.com/scylladb/scylladb/pull/23184 Closes scylladb/scylladb#23306 * github.com:scylladb/scylladb: main: allow abort during join_cluster main: add checkpoint before joining cluster storage_service: add start_sys_dist_ks	2025-03-26 15:48:58 +03:00
Sergey Zolotukhin	d448f3de77	database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`	2025-03-26 11:15:26 +01:00
Sergey Zolotukhin	0d9d0fe60e	database: Unify exception handling in `do_apply` and `apply_with_commitlog` Move exception wrapping logic from `do_apply` and `apply_with_commitlog` to `wrap_commitlog_add_error` to ensure consistent error handling.	2025-03-26 11:15:18 +01:00
Sergey Zolotukhin	b1e89246d4	storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325	2025-03-26 11:15:16 +01:00
Sergey Zolotukhin	6abfed9817	exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-26 11:15:13 +01:00
Evgeniy Naydanov	574c81eac6	test.py: random_failures: deselect topology ops for some injections After recent changes #18640 and #19151 started to reproduce for stop_after_sending_join_node_request and stop_after_bootstrapping_initial_raft_configuration error injections too. The solution is the same: deselect the tests. Fixes #23302 Closes scylladb/scylladb#23405	2025-03-26 12:07:12 +03:00
Pavel Emelyanov	38f37763d6	test/backup: Validate that restoring from non-existing sstables fails When restore API is called and is given a non-existing sstable (object name) the task should complete with failed status and some meaningful message in the error text. refs: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-26 10:55:42 +03:00
Pavel Emelyanov	02610a9072	test/backup: Collect sstables names after snapshot The scoped restoer test works like this - populate table - flush it - collect list of sstables - take snapshot - backup - restore (with the list of sstables as argument) - check the data is back Steps 2 and 3 are racy -- in case compaction comes in the middle, the list of collected sstables would differ from those snapshotted (and backuped) which will later lead to restore failure due to missing sstable. Fix by collecting the list of sstables after taking snapshot, and collect those not from the datadir, but from the snapshot dir. fixes: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-26 10:40:54 +03:00
Pavel Emelyanov	08004fe470	test/backup: Check that backup and restore succeed The scoped-restore test calls backup and restore APIs on several nodes, but doesn't check if any of the operations actually succeeds. Sometimes they indeed don't and test captures this, but in a weird manner -- the post-test checks for data presense fails, because the expected data is not in fact in its place. It's more debugging-friendly if we know in advance if backup or restore fails, rather than see that some data is missing after (failed) restore. refs: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-25 19:45:56 +03:00
Gleb Natapov	0aa4a82c83	messaging_service: do not call uninitialized _address_to_host_id_mapper std::function During messaging_service object creation remove_rpc_client function may be called if prefer_local snitch setting is true. The caller does not provide host id, so _address_to_host_id_mapper is called to obtain it, but at this point the function is not initialized yet. The patch fixes the code to not call the function if not initialized. This is not the problem since during messaging_service creation there is no connection to drop. Fixes: #23353 Message-ID: <Z-J2KbBK8NoFNYZZ@scylladb.com>	2025-03-25 18:41:16 +02:00
Wojciech Mitros	88d3fc68b5	alter_table_statement: fix renaming multiple columns in tables with views When we rename columns in a table which has materialized views depending on it, we need to also rename them in the materialized views' WHERE clauses. Currently, we do that by creating a new WHERE clause after each rename, with the updated column. This is later converted to a mutation that overwrites the WHERE clause. After multiple renames, we have multiple mutations, each overwriting the WHERE clause with one column renamed. As a result, the final WHERE clause is one of the modified clauses with one column renamed. Instead, we should prepare one new WHERE clause which includes all the renamed columns. This patch accomplishes this by processing all the column renames first, and only preparing the new view schema with the new WHERE clause afterwards. This patch also includes a test reproducer for this scenario. Fixes scylladb/scylladb#22194 Closes scylladb/scylladb#23152	2025-03-25 09:58:58 +01:00
Benny Halevy	9fac0045d1	boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:39:53 +02:00
Benny Halevy	62aeba759b	tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option `tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces, like `tablets_mode_for_new_keyspaces=enabled`. However, it does not allow to opt-out when creating new keyspaces by setting `tablets = {'enabled': false}`. Refs scylladb/scylla-enterprise#4355 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 15:32:16 +02:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Michael Litvak	49b8cf2d1d	storage_service: fix tablet split of materialized views This fixes an issue where materialized view tablets are not split because they are not registered as split candidates by the storage service. The code in storage_service::replicate_to_all_cores was changed in `4bfa3060d0` to handle normal tables and view tables separately, but with that change register_tablet_split_candidate is applied only to normal tables and not every table like before. We fix it by registering view tables as well. We add a test to verify that split of MV tables works. Closes scylladb/scylladb#23335	2025-03-24 08:23:58 +01:00
Pavel Emelyanov	79b9626d16	Merge 'service: do not include unused headers ' from Kefu Chai these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. also, updated the "iwyu.yaml" (short for include what you use) workflow to include "service" and "raft" subdirectories to prevent future regressions of including unused headers in them. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#23373 * github.com:scylladb/scylladb: .github: add "raft" and "service" subdirectories to CLEANER_DIR service: do not include unused headers	2025-03-24 10:20:15 +03:00
Avi Kivity	cc5fe542ed	test: ignore unused fmt::to_string() result fmt 11.1 apparently marks to_string() as [[nodiscard]]. Here we aren't interested in the result, so explicitly ignore it to avoid an error. Closes scylladb/scylladb#23403	2025-03-24 10:19:09 +03:00
Avi Kivity	9d49c3254f	install-dependencies.sh: disabiguate python magic package There are in fact two python magic packages, file-magic (that binds to libmagic and comes from the file package), magic, an independent one. The name we use in install-depedencies.sh, python3-magic, resolves to file-magic. In Fedora 42, the resolution from the name python3-magic to file-magic was removed [1], and so install-dependencies.sh now tries to install the wrong magic package, which turns out not to coexist with the one we want anyway. Fix by naming python3-file-magic directly instead. Since this is what's installed in the current frozen toolchain, there's no need to regenerate it; we're just making the package list work in Fedora 42. [1] `81910b7d88` Closes scylladb/scylladb#23402	2025-03-24 10:18:27 +03:00
Avi Kivity	cd04ab1a4e	test: avoid spaces when defining user-defined literal operator Clang 20 complains when it sees a user-defined literal operator defined with a space before the underscore. Assume it's adhering to the standard and comply. Closes scylladb/scylladb#23401	2025-03-24 10:17:12 +03:00
Pavel Emelyanov	d436fb8045	Merge 'Fix EAR not applied on write to S3 (but on read).' from Calle Wilund Fixes #23225 Fixes #23185 Adds a "wrap_sink" (with default implementation) to sstables::file_io_extension, and moves extension wrapping of file and sink objects to storage level. (Wrapping/handling on sstable level would be problematic, because for file storage we typically re-use the sstable file objects for sinks, whereas for S3 we do not). This ensures we apply encryption on both read and write, whereas we previously only did so on read -> fail. Adds io wrapper objects for adapting file/sink for default implementation, as well as a proper encrypted sink implementation for EAR. Unit tests for io objects and a macro test for S3 encrypted storage included. Closes scylladb/scylladb#23261 * github.com:scylladb/scylladb: encryption: Add "wrap_sink" to encryption sstable extension encrypted_file_impl: Add encrypted_data_sink sstables::storage: Move wrapping sstable components to storage provider sstables::file_io_extension: Add a "wrap_sink" method. sstables::file_io_extension: Make sstable argument to "wrap" const utils: Add "io-wrappers", useful IO helper types	2025-03-24 10:12:46 +03:00
Artsiom Mishuta	8bb6414037	test.py: reuse clusters in Python suite PR https://github.com/scylladb/scylladb/pull/22274 was introduced due to CI instability and want to mark the cluster dirty after each test for topology But in fact, affects only Python suites that are quite stable, and CI was Stabilized by PR https://github.com/scylladb/scylladb/pull/22252 This PR get back cluster reusage in Python test suites Closes scylladb/scylladb#23179	2025-03-23 20:08:36 +02:00
Kefu Chai	fdc5255eb8	build: disable DPDK for all release builds Previously, DPDK was enabled by default in standard release builds but disabled in "release-pgo" and "release-cs-pgo" builds. This inconsistency caused linking warnings during PGO phase 2, when trained profiles from non-DPDK builds were used with DPDK-enabled builds: ``` [1980/1983] LINK build/release/scylla ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor14run_some_tasksEv Hash = 2095857468992035112 up to 0 count discarded ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar7reactor6do_runEv Hash = 2184396189398169723 up to 50134372 count discarded ld.lld: warning: /home/avi/scylla-maint/build/release/seastar/libseastar.a(reactor.cc.o at 57829248): function control flow change detected (hash mismatch) _ZN7seastar18syscall_work_queue11submit_itemESt10unique_ptrINS0_9work_itemESt14default_deleteIS2_EE Hash = 1533150042646546219 up to 1979931 count discarded ``` Since DPDK is not used in production and increases build time, this change disables DPDK across all release build types. This both silences the warnings and improves build performance. Fixes #23323 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23391	2025-03-23 15:26:10 +02:00
Avi Kivity	9adfb91f46	Merge 'Introduce s3 data_source_impl for optimized object streaming' from Pavel Emelyanov Currently, to stream data from sstable component the sstables code uses file_data_source_impl. In case the component is on S3, the s3::readable_file is put into that data source. The data source is configured with 128k buffers and at most 4 read-ahead-s. With that configuration, downloading full object from S3 becomes too slow -- GET-ing file with 128k requests is not nice even with 4 parallel read-ahead-s. Better solution for S3 downloading is to request way larger chunk with one GET and then produce smaller, 128k or alike, buffers upon data arrival. This is what the newly introduced data source impl does -- it spawns a background GET and lets the upper input stream read buffers directly from the arriving body. This PR doesn't yet make sstable layer use the new sink, just introduces it and adds unit and perf tests. Testing \|Test\|Download speed, MB/s\| \|-\|-\| \|file_input_stream (), 1 socket \| 4.996\| \|file_input_stream (), 2 sockets \| 9.403\| \|s3_data_source (*) \| 93.164\| () The file_input_stream test renders 128k GETs and is configured to issue at most 4 read-ahead-s (*) The s3_data_source uses at most 1 socket regardless of what perf-test configures it to refs: #22458 Closes scylladb/scylladb#22907 github.com:scylladb/scylladb: test: Extend s3-perf test with stream download one test/perf: Tune-up s3 test options parsing test: Add unit test for newly introduced download source s3/client: Introduce data_source_impl for object downloading s3/client: Detach format_range_header() helper	2025-03-23 14:22:04 +02:00
Pavel Emelyanov	ca3b604afa	test: Extend s3-perf test with stream download one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:07 +03:00
Pavel Emelyanov	283e8e0706	test/perf: Tune-up s3 test options parsing Rename the `--upload bool` into `--operation string` one, so that new tests can be added in the future. Also rename run_download() to run_contiguous_get() because this is what the internals of this method do -- just GET contiguous ranges sequentially. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:07 +03:00
Pavel Emelyanov	bd313c581f	test: Add unit test for newly introduced download source Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Pavel Emelyanov	1f301b1c5d	s3/client: Introduce data_source_impl for object downloading The new data source implementation runs a single GET for the whole range specified and lends the body input_stream for the upper input_stream's get()-s. Eventually, getting the data from the body stream EOFs or fails. In either case, the existing body is closed and a new GET is spawn with the updater Range header so that not to include the bytes read so far. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Pavel Emelyanov	d47719f70e	s3/client: Detach format_range_header() helper The get_object_contiguous() formats the 'bytes=X-Y' one for its GET request. The very same code will be needed by next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-21 12:01:06 +03:00
Avi Kivity	7646e1448a	Merge 'cql3: Introduce RF-rack-valid keyspaces' from Dawid Mędrek This PR is an introductory step towards enforcing RF-rack-valid keyspaces in Scylla. The scope of changes: * defining RF-rack-valid keyspaces, * introducing a configuration option enforcing RF-rack-valid keyspaces, * restricting the CREATE and ALTER KEYSPACE statements so that they never lead to RF-rack invalid keyspaces, * during the initialization of a node, it verifies that all existing keyspaces are RF-rack-valid. If not, the initialization fails. We provide tests verifying that the changes behave as intended. --- Note that there are a number of things that still need to be implemented. That includes, for instance, restricting topology operations too. --- Implementation strategy (going beyond the scope of this PR): 1. Introduce the new configuration option `rf_rack_valid_keyspaces`. 2. Start enforcing RF-rack-validity in keyspaces if the option is enabled. 3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests. 4. Once the tests have been adjusted, change the default value of the option to enabled. 5. Stop explicitly enabling the option in tests. 6. Get rid of the option. --- Fixes scylladb/scylladb#20356 Fixes scylladb/scylladb#23276 Fixes scylladb/scylladb#23300 --- Backport: this is part of the requirements for releasing 2025.1. Closes scylladb/scylladb#23138 * github.com:scylladb/scylladb: main: Refuse to start node when RF-rack-invalid keyspace exists cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces db/config: Introduce RF-rack-valid keyspaces	2025-03-20 19:10:36 +02:00
Paweł Zakrzewski	0d14177409	audit/syslog: escape quotes and add explicit section names Before this change we outputted CSV-like structure, that looked like the following: Feb 27 12:31:30 scylla-audit: "10.200.200.41:0", "AUTH", "", "", "", "", "10.200.200.41:0", "cassandra", "false" While this is passably readable for humans, the ordering of fields is not clear and can be confusing. Furthermore, the `"` character (double quote) was not escaped. This is not an issue for CQL, but will be a problem for auditing Alternator, which will require logging JSON payloads. The new format will consist of key=value pairs and will escape the quote character, making it easy to parse programmatically. Feb 28 02:21:56 scylla-audit: node="10.200.200.41:0", category="AUTH", cl="", error="false", keyspace="", query="", client_ip="10.200.200.41:0", table="", username="cassandra" This is required for the auditing alternator feature. Closes scylladb/scylladb#23099	2025-03-20 19:55:51 +03:00
Calle Wilund	5c6337b887	encryption: Add "wrap_sink" to encryption sstable extension Creates a more efficient data_sink wrapper for encrypted output stream (S3).	2025-03-20 14:54:24 +00:00
Calle Wilund	9ac9813c62	encrypted_file_impl: Add encrypted_data_sink Adds a sibling type to encrypted file, a data_sink, that will write a data stream in the same block format as a file object would. Including end padding. For making encrypted data sink writing less cumbersome.	2025-03-20 14:54:24 +00:00
Calle Wilund	e02be77af7	sstables::storage: Move wrapping sstable components to storage provider Fixes #23225 Fixes #23185 Moved wrapping component files/sinks to storage provider. Also ensures to wrap data_sinks as well as actual files. This ensures that we actually write encryption if active.	2025-03-20 14:54:24 +00:00
Calle Wilund	d46dcbb769	sstables::file_io_extension: Add a "wrap_sink" method. Similar to wrap file, should wrap a data_sink (used for sstable writers), in obvious write-only, simple stream mode. Default impl will detect if we wrap files for this component, and if so, generate a file wrapper for the input sink, wrap this, and the wrap it in a file_data_sink_impl. This is obviously not efficient, so extensions used in actual non-test code should implement the method.	2025-03-20 14:54:22 +00:00
Calle Wilund	e100af5280	sstables::file_io_extension: Make sstable argument to "wrap" const This matches the signature of call sites. Since the only "real" extension to actually make a marker in the sstable will do so in the scylla component, which is writable even in a const sstable, this is ok.	2025-03-20 14:54:09 +00:00
Calle Wilund	98a6d0f79c	utils: Add "io-wrappers", useful IO helper types Mainly to add a somewhat functional file-impl wrapping a data_sink. This can implement a rudimentary, write-only, file based on any output sink. For testing, and because they fit there, place memory sink and source types there as well.	2025-03-20 14:54:09 +00:00
David Garcia	209ea2ea27	docs: update issues label Closes scylladb/scylladb#23304	2025-03-20 17:46:58 +03:00
Kefu Chai	c37149d106	test: stop using seastar::at_exit() seastar::at_exit() was marked deprecated recently. so let's use the recommended approach to perform cleanups. following tests were updated in this changes - scylla perf-tablets: tested with scylla perf-tablets - scylla perf-row-cache-update: tested with scylla perf-row-cache-update - scylla perf-fast-forward: tested with scylla perf-fast-forward --populate --run-tests small-partition-skips \ --smp 1 scylla perf-fast-forward --run-tests small-partition-skips \ --smp 1 - scylla perf-load-balancing: tested with scylla perf-load-balancing --nodes 3 --tablets1 16 --tablets2 16 --rf1 3 --rf2 3 --shards 16 - unit/row_cache_stress_test: tested with row_cache_stress_test --seconds 10 - perf/perf_cache_eviction: tested with ./perf_cache_eviction --seconds 1 --smp 1 - perf/perf_row_cache_reads: tested with ./perf_row_cache_reads Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23356	2025-03-20 17:44:57 +03:00
Ernest Zaslavsky	2fb5c7402e	s3_client: Rearrange credentials providers chain As the IAM role is not configured to assume a role at this moment, it makes sense to move the instance metadata credentials provider up in the chain. This avoids unnecessary network calls and prevents log clutter caused by failure messages. Closes scylladb/scylladb#23360	2025-03-20 17:43:04 +03:00
Pavel Emelyanov	23089e1387	Merge 'Enhance S3 client robustness' from Ernest Zaslavsky This PR introduces several key improvements to bolster the reliability of our S3 client, particularly in handling intermittent authentication and TLS-related issues. The changes include: 1. Automatic Credential Renewal and Request Retry: When credentials expire, the new retry strategy now resets the credentials and set the client to the retryable state, so the client will re-authenticate, and automatically retry the request. This change prevents transient authentication failures from propagating as fatal errors. 2. Enhanced Exception Unwrapping: The client now extracts the embedded std::system_error from std::nested_exception instances that may be raised by the Seastar HTTP client when using TLS. This allows for more precise error reporting and handling. 3. Expanded TLS Error Handling: We've added support for retryable TLS error codes within the std::system_error handler. This modification enables the client to detect and recover from transient TLS issues by retrying the affected operations. Together, these enhancements improve overall client robustness by ensuring smoother recovery from both credential and TLS-related errors. No backport needed since it is an enhancement Closes scylladb/scylladb#22150 * github.com:scylladb/scylladb: aws_error: Add GNU TLS codes s3_client: Handle nested std::system_error exceptions s3_client: Start using new retry strategy retry_strategy: Add custom retry strategy for S3 client retry_strategy: Make `should_retry` awaitable	2025-03-20 16:52:20 +03:00
Andrei Chekun	502b31d9c2	test.py: Refactor nodetool/conftest Remove using method for finding root dir of the project and start using the constant defined in package.	2025-03-20 11:41:30 +01:00
Andrei Chekun	1ea7b99385	test.py: Refactor test/pylib/cpp/ldap Rename and move prepare_instance from ldap tests directory to pylib/ldap_server.	2025-03-20 11:41:30 +01:00
Andrei Chekun	33e53565c4	test.py: move starting LDAP service to dedicate method Move starting LDAP to the method where the rest of the services are started. This will unify the way of starting the 3rd party services. Fix LDAP tests flakiness due not possible to connect to LDAP server Add catching stdout and stderr of toxiproxy-cli in case of errors	2025-03-20 11:37:04 +01:00
Pavel Emelyanov	339a849f13	transport: Remove connection::make_client_key() It's effectively unused, there's one place where connection initializes the client_data object using this helper, but that initialization looks better without it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23321	2025-03-20 10:22:05 +01:00
Calle Wilund	5cc3fc4f14	cluster/test_encryption: bring test from enterprise (and enable) Fixes scylladb/scylla-enterprise#5262 Part of the source-available code migration from scylla-enterprise.git to scylla.git. Original comment: topology_custom: add test_file_streaming_respects_encryption Reproducer for issue scylladb/scylla-enterprise#4246. Closes scylladb/scylladb#23320	2025-03-20 10:07:16 +02:00
Kefu Chai	ebf9125728	storage_proxy: Prevent integer overflow in abstract_read_executor::execute Fix UBSan abort caused by integer overflow when calculating time difference between read and write operations. The issue occurs when: 1. The queried partition on replicas is not purgeable (has no recorded modified time) 2. Digests don't match across replicas 3. The system attempts to calculate timespan using missing/negative last_modified timestamps This change skips cross-DC repair optimization when write timestamp is negative or missing, as this optimization is only relevant for reads occurring within write_timeout of a write. Error details: ``` service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long') SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80 Aborting on shard 1, in scheduling group sl:default ``` Related to previous fix `39325cf` which handled negative read_timestamp cases. Fixes #23314 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23359	2025-03-20 10:05:42 +02:00
Botond Dénes	d06bc27979	Merge 'Don't export string filenames from sstable' from Pavel Emelyanov There are several sstring-returning methods on class sstable that return paths to files. Mostly these are used to print them into logs, sometimes are used to be put into exception messages. And there are places that use these strings as file names. Since now sstables can also be stored on S3, generic code shouldn't consider those strings as on disk file names. Other than that, even when the methods are used to put component names into logs, in many cases these log messages come with debug or trace level, so generated strings are immediately dropped on the floor, but generating it is not extremely cheap. Code would benefit from using lazily-printed names. This change introduces the component_name struct that wraps sstable reference and component ID (which is a numerical enum of several items). When printed, the component_name formatter calls the aforementioned filename generation, thus implementing lazy printing. And since there's no automatic conversion of component_name-s into strings, all the code that treats them as file paths, becomes explicit. refs: #14122 (previous ugly attempt to achieve the same goal) Closes scylladb/scylladb#23194 * github.com:scylladb/scylladb: sstable: Remove unused malformed_sstable_exctpion(string filename) sstables: Make filename() return component_name sstables: Make file_writer keep component_name on board sstables: Make get_filename() return component_name sstables: Make toc_filename() return component_name sstables: Make sstable::index_filename() return component_name sstables: Introduce struct component_name sstables: Remove unused sstable::component_filenames() method sstables: Do not print component filenames on load-and-stream wrap-up sstables: Explicitly format prefix in S3 object name making sstables: Don't include directory name in exception sstables: Use fmt::format instead of string concatenation sstables: Rename filename($component) calls to ${component}_filename() sstables: Rename local filename variable to component_name	2025-03-20 09:51:03 +02:00
Kefu Chai	fd14a23aab	.github: add "raft" and "service" subdirectories to CLEANER_DIR in order to prevent future inclusion of unused headers, let's include "raft" and "service" subdirectories to CLEANER_DIR, so that this workflow can identify the regressions in future. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-20 11:18:16 +08:00
Kefu Chai	b3e2561ed8	service: do not include unused headers these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-20 11:18:16 +08:00
Avi Kivity	a62ab824e6	schema: deprecate schema_extension schema_extension allows making invisible changes to system_schema that evade upgrade rollback tests. They appear in system_schema as an encoded blob which reduces serviceability, as they cannot be read. Deprecate it and point users to adding explicit columns in scylla_tables. We could probably make use of the data structure, after we teach it to encode its payload into proper named and typed columns instead of using IDL. Closes scylladb/scylladb#23151	2025-03-19 20:36:16 +02:00
Kefu Chai	8fdaaf6491	service/storage_proxy: Improve digest comparison Previously, the code used a find_if to compare each digest to the first one to check for any mismatches. This was less readable. This change replaces that with `std::ranges::all_of`, which checks if all elements in the range are equal to the first digest, improving readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23332	2025-03-19 18:21:14 +03:00
Nadav Har'El	317de64281	test/alternator: enable debugging output during Python crashes For a long time now, we've been seeing (see #17564), once in a while, Alternator tests crashing with the Python process getting killed on SIGSEGV after the tests have already finished successfully and all pytest had to do is exit. We have not been able to figure out where the bug is. Unfortunately, we've never been able to reproduce this bug locally - and only rarely we see it in CI runs, and when it happens we don't any information on why it happend. So the goal of this patch is to print more information that might hopefully help us next time we see this problem in CI (this patch does NOT fix the bug). This patch adds to test/alternator's conftest.py a call to faulthandler.enable(). This traps SIGSEGV and prints a stack trace (for each thread, if there are several) showing what Python was trying to do while it is crashing. Hopefully we'll see in this output some specific cleanup function belonging to boto3 or urllib or whatever, and be able to figure out where the bug is and how to avoid it. We could have added this faulthandler.enable() call to the top-level conftest.py or to test.py, but since we only ever had this Python crash in Alternator tests, I think it is more suitable that we limit this desperate debugging attempt only to Alternator tests. Refs #17564 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23340	2025-03-19 18:18:51 +03:00
Dawid Mędrek	0e04a6f3eb	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300	2025-03-19 15:13:44 +01:00
Dawid Mędrek	41f862d7ba	cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces In this commit, we refuse to create or alter a keyspace when that operation would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is enabled. We provide two tests verifying that the changes work as intended. Fixes scylladb/scylladb#23276	2025-03-19 14:51:47 +01:00
Dawid Mędrek	32879ec0d5	db/config: Introduce RF-rack-valid keyspaces We introduce a new term in the glossary: RF-rack-valid keyspace. We also highlight in our user documentation that all keyspaces must remain RF-rack-valid throughout their lifetime, and failing to guarantee that may result in data inconsistencies or other issues. We base that information on our experience with materialized views in keyspaces using tablets, even though they remain an experimental feature. Along with the new term, we introduce a new configuration option called `rf_rack_valid_keyspaces`, which, when enabled, will enforce preserving all keyspaces RF-rack-valid. That functionality will be implemented in upcoming commits. For now, we materialize the restriction in form of a named requirement: a function verifying that the passed keyspace is RF-rack-valid. The option is disabled by default. That will change once we adjust the existing tests to the new semantics. Once that is done, the option will first be enabled by default, and then it will be removed. Fixes scylladb/scylladb#20356	2025-03-19 14:46:35 +01:00
Pavel Emelyanov	6e7d6b06f0	api: Squash two parse_table_infos into one There are currently three of them: - one that works on query parameter value - one that works on query parameters map - one that works on the request itself The second one is not used any longer by anyone by the third one, so squash them together. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:53:38 +03:00
Pavel Emelyanov	851bd38953	api: Generalize keyspaces:tables parsing a little bit more Continuation of the previous patch -- there's one caller that uses "non standard" name for the tables query parameter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:52:54 +03:00
Pavel Emelyanov	dc3455bc55	api: Provide general pair<keyspace, vector<table>> parsing Lots of API handlers get "keyspace" path parameter and parse the "cf" query one into a vector of table_infos. Generalize those places. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:51:57 +03:00
Pavel Emelyanov	722f282748	api: Remove ks_cf_func and related code The type in question is used by two endpoint handlers that are called with validated keyspace name and parsed vector of table_info-s. Both handlers can parse what they need on their own, all the more so next patches will make this parsing even more simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 15:49:55 +03:00
Pavel Emelyanov	73187a2e19	Merge 'mutation/mutation_consumer_concepts: simplify consumer hierarchy' from Botond Dénes The reader consumer concept hierarchy is a sprawling confusing jungle of deeply nested concepts. Looking at `FlattenedConsumer[V2]` -- the subject of this PR: this consumer is defined in terms of the `StreamedMutationConsumer[V2]` which in terms is defined in terms of the `FragmentConsumer[V2]`. This amount of nesting makes it really hard to see what a concept actually comes down to: made even more difficult by the fact that the concepts are scattered across two header files. In theory, this nesting allows for greater flexibility: some code can use a lower lever concept directly while it can also serve as the basis for the higher lever concepts. But the fact of the matter is that none of the lower level concepts are used directly, so we pay the price in hard-to-follow code for no benefit. This PR cuts down the complexity by folding up the entire hierarchy into the top-level `FlattenedConsumer[V2]` and `FlatteneConsumerReturning[V2]` concepts. Doing this immediately reveals just how similar the two major consumer concepts (`FlattenedConsumer[V2]` and `MutationFragmentConsumer[V2]`) supported by `mutation_reader` are. In a follow-up PR, we will attempt to unify the two. Refactoring, no backport needed. Closes scylladb/scylladb#23344 * github.com:scylladb/scylladb: mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2] mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2] test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/	2025-03-19 15:43:00 +03:00
Pavel Emelyanov	a408a7abe1	sstable: Remove unused malformed_sstable_exctpion(string filename) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	f06cc32812	sstables: Make filename() return component_name Similarly to toc_, index_ and data filenames, make the generic component name getter return back not string, but a wrapper object. Most of callers are log messages and exception generations. Other than that there are tests, filesystem storage driver and few more places in generic code who "know" that they work with real files, so make them use explicit fmt::to_string(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	68c41f0459	sstables: Make file_writer keep component_name on board The class in question is a wrapper around output_stream that writes, flushes and closes the stream in async context. For logging it also keeps the component filename on board, and now it's good time to patch it and keep the component_filename instead. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	1ba91e28cb	sstables: Make get_filename() return component_name Similarly to previous patches -- mostly the result is used as log argument. The remaining users include - scylla sstable tool that dumps component names to json output - API endpoint that returns component names to user - tests these are all good to explicitly convert component_names to strings. There are few more places that expect strings instead of component name objects. For now they also use fmt::to_string() explicitly, partially it will be fixed later, mostly -- as future follow-ups. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	0cdeed858c	sstables: Make toc_filename() return component_name Most of the callers use the returned value as log message parameter, some construct malformed_sstable_exception that was prepared by previous patch. The remaining callers explicitly use fmt::to_string(), these are - pending deletion log creation - filesystem storage code - tests - stream-blob code that re-loads sstable All but the last one are OK to use string toc name, the last one is not very correct in its usage of toc_filename string, but it needs more care to be fixed properly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:03:29 +03:00
Pavel Emelyanov	80e0030613	sstables: Make sstable::index_filename() return component_name Most of the method callers use it as log parameter. There are few more places that push it to malformed_sstable_exception, which immediately converts it to string, so this patch makes the exception be constructed with the component_name either. And there's one more place that passes this string to file_writer constructor. For now, convert it to string explicitly, but next patches will fix that place to use pure component_name too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 13:01:23 +03:00
Pavel Emelyanov	dbb9ee15c1	sstables: Introduce struct component_name The structure wraps const reference to sstable and component_name value (it's an enum of several elements). It also has a formatter so that it can be directly printed in logs (main usage) as well as converted to strings (auxiliary and discourage usage). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	aba400f5d9	sstables: Remove unused sstable::component_filenames() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	24e5c30cc8	sstables: Do not print component filenames on load-and-stream wrap-up When load-and-stream finishes it may call sstable::unlink() method to drop the loaded (and streamed) sstable. Before calling it it prints a log message about its intention that includes component_filenames() vector. This log message is ugly in several ways. First, it prints only recognized components, while unlink() method unlinks all of them, so it's sort of misleading (it doesn't seem that anyone ever read this message IRL though) Next, that's the only place that is _that_ verbose about sstable unlinking. "Common" unlinking paths don't print that much info. Finally, the log message happen in debug level, so it's hardly ever appears in any logs, but collecting several filenames takes time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	fb2bd91009	sstables: Explicitly format prefix in S3 object name making Sometimes a component object name looks like s3://bucket/prefix/component. For that the path formatting code formats bucket name with the result of sstable->filename() invocation. This patch changes it to format bucket name, prefix itself and sstable->component_filename(). The change is idempotent, as sstable::filename() just concatenates prefix with sstable::component_filename(). This change will help to remove the former method from sstable soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	f212b5efa9	sstables: Don't include directory name in exception When filesystem storage throws an exception about failure to create components hardlinks, it includes three paths into it -- source file name, destination file name and the directory name. The directory name is excessive, source file name already has it. Also, this change will make it possible to remove one of malformed_sstable_exception constructors soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	a8bc81eb3c	sstables: Use fmt::format instead of string concatenation There are some places that concatentate filenames with something else to get different filename (tool does it) or message for exception (read_toc() helper). This patch uses fmt::format() instead to facilitate future patching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	dcc9167734	sstables: Rename filename($component) calls to ${component}_filename() There's a generic sstable::filename(component_type) method that returns a file name for the given component. For "popular" components, namely TOC, Data and Index there are dedicated sstable methods to get their names. Fix existing callers of the generic method to use the former. It's shorter, nicer and makes further patching simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:21 +03:00
Pavel Emelyanov	e6898a8854	sstables: Rename local filename variable to component_name This is to be consistent with future changes and not to bloat them with extra renames Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-19 12:45:20 +03:00
Kefu Chai	1ab2b7e7a0	tree: fix misspellings these two misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23357	2025-03-19 09:13:20 +02:00
Botond Dénes	8f0d0daf53	Merge 'repair: allow concurrent repair and migration of two different tablets' from Aleksandra Martyniuk Do not hold erm during repair of a tablet that is started with tablet repair scheduler. This way two different tablets can be repaired and migrated concurrently. The same tablet won't be migrated while being repaired as it is provided by topology coordinator. Use topology_guard to maintain safety. Fixes: https://github.com/scylladb/scylladb/issues/22408. Needs backport to 2025.1 that introduces the tablet repair scheduler. Closes scylladb/scylladb#22842 * github.com:scylladb/scylladb: test: add test to check concurrent tablets migration and repair repair: do not hold erm for repair scheduled by scheduler repair: get total rf based on current erm repair: make shard_repair_task_impl::erm private repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary repair: pass session_id to repair_writer_impl::create_writer repair: keep materialized topology guard in shard_repair_task_impl repair: pass session_id to repair_meta	2025-03-19 08:55:24 +02:00
Kefu Chai	aca00118fb	service: fix misspellings these misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23334	2025-03-18 22:21:45 +02:00
Piotr Dulikowski	2ca1c0b6f9	Merge 'introduce the new Raft-based recovery procedure for group 0 majority loss' from Patryk Jędrzejczak This PR introduces the new Raft-based recovery procedure for group 0 majority loss. The Raft-based recovery procedure works with tablets. The old gossip-based recovery procedure does not because we have no code for tablet migrations after the gossip-based topology changes. The Raft-based procedure requires the Raft-based topology to be enabled in the cluster. If the Raft-based topology is not enabled, the gossip-based procedure must be used. We will be able to get rid of the gossip-based procedure when we make the Raft-based topology mandatory (we can do both in the same version, 2025.2 is the plan). Before we do it, we will have to keep both procedures and explain when each of them should be used. The idea behind the new procedure is to recreate group 0 without touching the topology structures. Once we create a new group 0, we can remove all dead nodes using the standard `removenode` and `replace` operations. For the procedure to be safe, we must ensure that each member of the new group 0 moves to the same initial group 0 state. Also, the only safe choice for the state is the latest persistent state available among the live nodes. The solution to the problem above is to ensure that the leader of the new group 0 (called the recovery leader) is one of the nodes with the latest state available. Other members will receive the snapshot from the recovery leader when they join the new group 0 and move to its state. Below is the shortened description of the new recovery procedure from the perspective of the administrator. For the full description, refer to the design document. 1. Find the set of live nodes. 2. Kill any live node that shouldn't be a member of the new group 0. 3. Ensure the full network connectivity between live nodes. 4. Rolling restart live nodes to ensure they are healthy and ready for recovery. 5. Check if some data could have been lost. If yes, restore it from backup after the recovery procedure. 6. Find the recovery leader (the node with the largest `group0_state_id`). 7. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 8. Set the new scylla.yaml parameter, `recovery_leader`, to Host ID of the recovery leader on each live node. 9. Rolling restart all live nodes, but the recovery leader must be restarted first. 10. Remove all dead nodes using `removenode` or `replace`. 11. Unset `recovery_leader` on all nodes. 12. Delete data of the old group 0 from `system.raft`, `system.raft_snaphots`, and `system.raft_snapshot_config`. In the future, we could automate some of these steps or even introduce a tool that will do all (or most) of them by itself. For now, we are fine with a procedure that is reliable and simple enough. This PR makes using 2025.1 with tablets much safer. We want to backport it to 2025.1. We will also want to backport a few follow-ups. Fixes scylladb/scylladb#20657 Closes scylladb/scylladb#22286 * github.com:scylladb/scylladb: test: mark tests with the gossip-based recovery procedure test: add tests for the Raft-based recovery procedure test: topology: util: fix the tokens consistency check for left nodes test: topology: util: extend start_writes gossip: allow group 0 ID mismatch in the Raft-based recovery procedure raft_group0: modify_raft_voter_status: do not add new members treewide: allow recreating group 0 in the Raft-based recovery procedure	2025-03-18 19:10:56 +01:00
Yaron Kaikov	b375222408	./github/scripts/auto-backport.py: don't remove backport label when backport process has an error Today, when the `Fixes` prefix is missing or the developer is not a collaborator with `scylladbbot` we remove the backport labels to prevent the process from starting and notifying the developers. Developers are worried that removing these backport labels will cause us to forget we need to do these backports. @nyh suggested to add a `scylladbbot/backport_error` label instead Applied those changes, so when a `Fixes` prefix is missing we will add a `scylladbbot/backport_error` label and stop the process When a user doesn't accept the invite we will still open the PR but he will not be assigned and will not be able to edit the branch when we have conflicts Fixes: https://github.com/scylladb/scylla-pkg/issues/4898 Fixes: https://github.com/scylladb/scylla-pkg/issues/4897 Closes scylladb/scylladb#23259	2025-03-18 16:19:09 +02:00
Pavel Emelyanov	420b5bee20	test/s3: Increase boost/s3_test log levels When something goes wrong, it's impossible to find anyting out without s3 and http logs, so increase them for boost tests. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23245	2025-03-18 15:59:05 +02:00
Botond Dénes	a2d0d7b9a0	mutation: fold FragmentConsumer[V2] into FlattenedConsumer[V2] FragmentConsumer[V2] also has no direct users, so fold it into FlattenedConsumer[V2] as well. With this, FlattenedConsumer[V2] has a nice and simple definition, with a single nesting level required due to the return-type flexibility.	2025-03-18 09:24:49 -04:00
Botond Dénes	8768e2e08e	mutation: fold StreamedMutationConsumer[V2] into FlattenedConsumer[V2] No code uses StreamedMutationConsumer[V2] directly, so let's take this opportunity to reduce the jungle of consumer concepts.	2025-03-18 09:24:44 -04:00
Botond Dénes	969b07fdfd	test/lib/fragment_scatterer: s/StreamedMutationConsumer/FlattenedConsumer/ The class actually implements the FlattenedConsumer, so fix the comment. This eliminates the only reference to the StreamedMutationConsumer concept.	2025-03-18 07:57:04 -04:00
Avi Kivity	9867129c7b	Update seastar submodule * seastar 412d058cf9...2f13c461bb (2): > smp: prefaulter: don't leave zombie worker threads Fixes #23316 > demos/tcp_sctp_server_demo: Modernize with seastar::async and proper teardown Closes scylladb/scylladb#23317	2025-03-18 13:36:05 +02:00
Botond Dénes	2795d83b32	Merge 'commitlog: Serialize file deletion and distribute replayed segments' from Calle Wilund Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. As noted in the issue above, when replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. Closes scylladb/scylladb#23150 * github.com:scylladb/scylladb: main/commitlog: wait for file deletion and distribute recycled segments to shards commitlog: Serialize file deletion	2025-03-18 11:47:17 +02:00
Avi Kivity	176bb464a2	github: error if we see #include "seastar/..." Seastar is a system library from ScyllaDB's persepective and so should use angle brackets for #include statements. Closes scylladb/scylladb#23308	2025-03-17 21:56:48 +02:00
Ernest Zaslavsky	08b9e4d87b	aws_error: Add GNU TLS codes Add GNU TLS error codes to std::system_error handler since we can start getting these once they seep from seastar's http client	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	012f0e6d8c	s3_client: Handle nested std::system_error exceptions Enhance error handling by detecting and processing std::system_error exceptions nested within std::nested_exception. This improvement ensures that system-level errors wrapped in the exception chain are properly caught and managed, leading to more robust error reporting and recovery.	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	367140a9c5	s3_client: Start using new retry strategy * Previously, token expiration was considered a fatal error. With this change, the `s3_client` uses new retry strategy that is trying to renew expired creds * Added related test to the `s3_proxy`	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	ed09614c27	retry_strategy: Add custom retry strategy for S3 client Introduced a new retry strategy that extends the default implementation. The should_retry method is overridden to handle a specific case for expired credential tokens. When an expired token error is detected, the credentials are reset so it is expected that the client will re-authenticates, and the original request is retried.	2025-03-17 16:38:14 +02:00
Ernest Zaslavsky	26062c65e4	retry_strategy: Make `should_retry` awaitable	2025-03-17 16:36:26 +02:00
Avi Kivity	0e4b303339	tools: toolchain: regenerate for python3-pytest-asyncio 0.24 Fixes a bug related to load_scope="module". python-driver fixed to version 3.28.2, as it looks like 3.29.0 regressed TLS handling [1]. In any case tools/cqlsh fixes it to 3.28.2. Optimized clang from https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-19.1.7-Fedora-41-x86_64.tar.gz Ref #22960. Fixes #23213 [1] https://github.com/scylladb/python-driver/issues/456 Closes scylladb/scylladb#23236	2025-03-17 15:41:55 +02:00
Botond Dénes	fda3486770	Merge 'Remove some excessive ks:cf -> table_id conversions in API and schema_tables' from Pavel Emelyanov Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps. There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps. Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742 #21533 Closes scylladb/scylladb#23216 * github.com:scylladb/scylladb: api: Remove the remaining parse_tables() overload database: Sanitize flush_tables_on_all_shards() schema_tables: Remove all_table_names() database: Make tables flushing helper use table_info-s, not names api: Make keyspace flush endpoint use parse_table_infos() (and a bit more) schema_tables,client_state: Switch to using all_table_infos() schema_tables: Tune up some methods to benefit from table_infos schema_tables: Introduce all_table_infos()	2025-03-17 15:40:41 +02:00
Pavel Emelyanov	6217124d1d	s3/client: Make "expected" reply status truly optional Currently when a client::make_request() is called it can pass std::optional<status> argument indicating which status it expects from server. In case status doesn't match, the request body handler won't be called, the request will fail with unexpected status exception. However, disengaged expected implicitly means, that the requestor expects the OK (200) status. This makes it impossible to make a query which return status is not known in advance and it's up to the handler to check it. Lower level http client allows disengaged expected with the described semantics -- handler will check status its own. This behavios for s3 client is needed for GET request. Server can respond with OK or partial content status depending on the Range header. If the header is absent or is large enough for the requested object to fit into it, the status would be OK, if the object is "trimmed" the status is partial content. In the end of the day, requestor cannot "guess" the returning status in advance and should check it upon response arrival. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23243	2025-03-17 15:34:58 +02:00
Botond Dénes	afa305ffb4	Merge 'perf/perf_sstable: stop using at_exit() ' from Kefu Chai `seastar::at_exit()` was marked deprecated recently. so let's use the recommended approach to perform cleanups. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#23253 * github.com:scylladb/scylladb: perf/perf_sstable: fix the indent perf/perf_sstable: stop using at_exit()	2025-03-17 15:30:10 +02:00
Andrei Chekun	d68e54c26d	test.py: Remove reuse cluster in cluster tests Pool is not aware of the cluster configuration, so it can return cluster to the test that is not suitable for it. Removing reuse will remove such possibility, so there will be less flaky tests. Closes scylladb/scylladb#23277	2025-03-17 15:27:59 +02:00
Calle Wilund	1525cb2dba	main/commitlog: wait for file deletion and distribute recycled segments to shards Refs #23017 When replaying a large commitlog from an unclean node, we can cause shard 0 db commitlog to reach footprint limit, and then remain there (because we never release segments lower than limit). This is wasteful with diskspace. But deleting segments early here is also wasteful; A better solution is to simply give the segments to all CL shards, thus distributing the available space. v2: * Do segement distribution using ranges. go c++23	2025-03-17 12:09:00 +00:00
Calle Wilund	4ed81e05bf	commitlog: Serialize file deletion Fixes #23017 When deleting segments while our footprint is over the limit, mainly when recycling/deleting segments after replay (recover boot) we can cause two deletion passes to be running at the same time. This is because delete is triggered by either a.) replay release b.) timer check (explicit) c.) timer initiated flush callback where the last one is in fact not even waited for. If we are considering many files for delete/recycle, we can, due to task switch, end up considering segments ok to keep, in parallel, even though one of them should be deleted. The end result will be us keeping one more segment than should be allowed. Now, eventually, this should be released, once we do deletion again, but this can take a while. Solution is to simply ensure we serialize deletion. This might cause some delay in processing cycles for recycle, but in practice, this should never happen when we are in fact under pressure. Small unit test included.	2025-03-17 12:09:00 +00:00
Anna Stuchlik	cd61f60549	doc: fix product names in the 2025.1 upgrage guides This commit fixes the product names in the upgrade 2025.1 guides so that: - 6.2 is preceded with "ScyllaDB Open Source" - 2024.x is preceded with "ScyllaDB Enterprise" - 2025.1 is preceded with "ScyllaDB" Fixes https://github.com/scylladb/scylladb/issues/23154 Closes scylladb/scylladb#23223	2025-03-17 13:54:11 +03:00
Anna Stuchlik	dbbf9e19e4	doc: remove the outdated info on seeds-info This commit removes the outdated information about seed nodes. We no longer need it in the docs, as a) the documentation is versioned, and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions mentioned in the docs are no longer supported. In addition, some clarification has been added to the existing sections. Fixes https://github.com/scylladb/scylladb/issues/22400 Closes scylladb/scylladb#23282	2025-03-17 13:53:48 +03:00
Andrei Chekun	7423edb1f7	test.py: Increase verbosity of pytest Currently, pytest truncates long objects in assertions. This makes understanding the failure message difficult. This will increase verbosity and pytest will stop truncating messages. Closes scylladb/scylladb#23263	2025-03-17 12:51:41 +02:00
Aleksandra Martyniuk	20f9d7b6eb	test: add test to check concurrent tablets migration and repair Add a test to check whether a tablet can be migrated while another tablet is repaired.	2025-03-17 10:37:03 +01:00
Aleksandra Martyniuk	5b792bdc98	repair: do not hold erm for repair scheduled by scheduler Do not hold erm for tablet repair scheduled by scheduler. Thanks to that one tablet repair won't exclude migration of other tablets. Concurrent repair and migration of the same tablet isn't possible, since a tablet can be in one type of transition only at the time. Hence the change is safe. Refs: https://github.com/scylladb/scylladb/issues/22408.	2025-03-17 10:37:02 +01:00
Aleksandra Martyniuk	a1375896df	repair: get total rf based on current erm Get total rf based on erm. Currently, it does not change anything because erm stays the same during the whole repair.	2025-03-17 10:36:18 +01:00
Aleksandra Martyniuk	34cd485553	repair: make shard_repair_task_impl::erm private Make shard_repair_task_impl::erm private. Access it with getter.	2025-03-17 10:36:14 +01:00
Andrei Chekun	a20d848c01	test.py: Refactor test/conftest.py Move functions responsible for preparation of the environment to the util file. This is extracted from https://github.com/scylladb/scylladb/pull/22894 to make it easier to work together. Closes scylladb/scylladb#23221	2025-03-17 11:31:00 +02:00
Avi Kivity	4416b0c732	treewide: use angle brackets for including seastar headers Seastar is an external library, so we use angle brackets to include its interfaces. Closes scylladb/scylladb#23301	2025-03-17 10:03:06 +02:00
Andrei Chekun	1e1d213592	test.py: Remove additional report generation for python tests Pytest is responsible for generation the report of the failed tests and there is no need to generate it one more time Closes scylladb/scylladb#23237	2025-03-17 09:36:08 +02:00
Kefu Chai	f8800b3f19	ent/encryption: rename "padd" to "padding"/"pad" and use structured bindings Replace the abbreviated term "padd" with either "padding" or "pad" throughout the encryption module. While "padd" was originally chosen to align with other variable names ("type" and "mode"), using standard terminology improves code readability and resolves codespell warnings. Additionally, refactor relevant code to use C++ structured bindings for cleaner implementation. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23251	2025-03-17 09:23:42 +02:00
Raphael S. Carvalho	e9944f0b7c	service: Introduce rack-aware co-location migrations for tablet merge Merge co-location can emit migrations across racks even when RF=#racks, reducing availability and affecting consistency of base-view pairing. Given replica set of sibling tablets T0 and T1 below: [T0: (rack1,rack3,rack2)] [T1: (rack2,rack1,rack3)] Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at only a subset of racks, reducing availability. This is the main problem fixed by this patch. It also lays the ground for consistent base-view replica pairing, which is rack-based. For tables on which views can be created we plan to enforce the constraint that replicas don't move across racks and that all tablets use the same set of racks (RF=#racks). This patch avoids moving replicas across racks unless it's necessary, so if the constraint is satisfied before merge, there will be no co-locating migrations across racks. This constraint of RF=#racks is not enforced yet, it requires more extensive changes. Fixes #22994. Refs #17265. This patch is based on Raphael's work done in PR #23081. The main differences are: 1) Instead of sorting replicas by rack, we try to find replicas in sibling tablets which belong to the same rack. This is similar to how we match replicas within the same host. It reduces number of across-rack migrations even if RF!=#racks, which the original patch didn't handle. Unlike the original patch, it also avoids rack-overloaded in case RF!=#racks 2) We emit across-rack co-locating migrations if we have no other choice in order to finalize the merge This is ok, since views are not supported with tablets yet. Later, we will disallow this for tables which have views, and we will allow creating views in the first place only when no such migrations can happen (RF=#racks). 3) Added boost unit test which checks that rack overload is avoided during merge in case RF<#racks 4) Moved logging of across-rack migration to debug level 5) Exposed metric for across-rack co-locating migrations Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com> Closes scylladb/scylladb#23247	2025-03-16 22:45:00 +02:00
Pavel Emelyanov	95809a3ed1	Update seastar submodule * seastar 5b95d1d7...412d058c (62): > fstream: Export functions for making file_data_source > build: Include DPDK dependency libraries in Seastar linkage > demos/tls_echo_server_demo: Modernize with seastar::async > http/client: Pass abort source by pointer > rpc: remove deprecated logging function support > github: Add Alpine Linux workflow to test builds with musl libc > exception_hacks: Make dl_iterate_phdr resolution manual > tests: relax test_file_system_space check for empty filesystems > demos/udp_server_demo: Modernize with seastar::async and proper teardown > future: remove deprecated functions/concepts > util: logger: remove deprecated set_stdout_enabled and logger_ostream_type::{stdout,stderr} > memory: guard __GLIBC_PREREQ usage with __GLIBC__ check > scheduling_specific: Add noexcept wrapper for free() > file: Replace __gid_t with standard POSIX gid_t > aio_storage_context: Use reactor::do_at_exit() > json2code: support chunked_fifo > json: remove unused headers > httpd: test cases for streaming > build: use find_dependency() instead find_package() in config file > build: stop using a loop for finding dependencies > dns: Fix event processing to work safely with recent c-ares > tutorial: add a section about initialization and cleanup > reactor: deprecate at_exit() > httpclient: Add exception handling to connection::close > file: document max_length-limits for dma_read/write funcs taking vector<iovec> > build: fix P2582R1 detection in GCC compatibility check > json2code: optimize string handling using std::string_view > tests/unit: fix typo in test output > doc: Update documentation after removing build.sh > test: Add direct exception passing for awaits for perf test > github: add Docker build verification workflow > docker: update LLVM debian repo for Ubuntu Orcular migration > tests/unit: Use http.HTTPStatus constants instead of raw status codes > tests/unit: Fix exception verification in json2code_test.py > httpd: handle streaming results in more handlers > json: stream_object now moves value > json: support for rvalue ranges > chunked_fifo: make copyable > reactor: deprecate at_destroy() > testing: prevent test scheduling after reactor exit > net: Add bytes sent/received metrics > net: switch rss_key_type to std::span instead of std::string_view > log: fixes for libc++ 19 > sstring: fixes for lib++ 19 > build: finalize numactl dependency removal > build: link DPDK against libnuma when detected during build > memory: remove libnuma dependency > treewide: replace assert with SEASTAR_ASSERT > future: fix typo in comment > http: Unwrap nested exceptions to handle retryable transport errors > net/ip, net: sed -i 's/to_ulong/to_uint/' > core: function_traits noexcept specializations > util/variant: seastar::visit forward value arg > net/tls: fix missing include > tls: Add a way to inspect peer certificate chain > websocket: Extract encode_base64() function > websocket: Rename wlogger to websocket_logger > websocket: Extract parts of server_connection usable for client > websocket: Rename connection to server_connection > websocket: Extract websocket parser to separate file > json2code_test: factor out query method > seastar-json2code: fix error handling Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23281	2025-03-16 21:57:43 +02:00
Benny Halevy	41f02c521d	main: allow abort during join_cluster Bootstrap or replace can take a long time, but since `feef7d3fa1`, the stop_signal is checked only in checkpoints, and in particular, abort isn't requested during join_cluster. Fixes #23222 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:21:15 +02:00
Benny Halevy	f269480f53	main: add checkpoint before joining cluster Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:08:04 +02:00
Benny Halevy	0fc196991a	storage_service: add start_sys_dist_ks Currently, there's a call to `supervisor::notify("starting system distributed keyspace")` which is misleading as it is identical to a similar message in main() when starting the sharded service. Change that to a storage_service log messages and be more specific that the sys_dist_ks shards are started. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-16 12:05:23 +02:00
Jenkins Promoter	d84da3dc11	Update pgo profiles - x86_64	2025-03-15 04:57:28 +02:00
Jenkins Promoter	6e8e2ae333	Update pgo profiles - aarch64	2025-03-15 04:48:49 +02:00
Pavel Emelyanov	604fdd86e9	test: Count mutation fragments verbosily in scoped restore test Sometimes after scoped restore a key is not found in nodes' mutation fragments. This patch makes the counting more verbose to get better understanding of what's going on in case of test failure refs: #23189 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23296	2025-03-14 21:31:36 +02:00
Pavel Emelyanov	bfbe802632	streaming: Relax load_sstable_for_tablet() The method does several excessive things, that can be relaxed 1. In order to transfer a table-id to another shard, finds the table on source shard, gets schema and captures schema id on invoke_on()'s lambda. It can just capture the original table-id 2. In order to get sstable parameters (format, version, etc.) generates toc_filename(), then calls parse_path() to convert it into the entry_descriptor. The descriptor can be read from sstable directly. 3. Logging "success" includes target shard into the message, but happens on the source shard. The message can be just logged on target shard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23197	2025-03-14 15:26:48 +02:00
Botond Dénes	39bcf99f8e	Merge 'Apply hard limit to partition range vectors in secondary index queries' from Nikos Dragazis Secondary index queries fetch partition keys from the index view and store them in an `std::vector`. The vector size is currently limited by the user's page size and the page memory limit (1MiB). These are not enough to prevent large contiguous allocations (which can lead to stalls). This series introduces a hard limit to the vector size to ensure it does not exceed the allocator's preferred max contiguous allocation size (128KiB). With the size of each element being 120 bytes, this allows for 1092 partition keys. The limit was set to 1000. Any partitions above this limit are discarded. Discarding partitions breaks the querier cache on the replicas, causing a performance regression, as can be seen from the following measurements: ``` * Cluster: 3 nodes (local Docker containers), 1 vCPU, 4GB memory, dev mode * Schema: CREATE KEYSPACE ks WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false}; CREATE TABLE ks.t1 (pk1 int, pk2 int, ck int, value int, PRIMARY KEY ((pk1, pk2), ck)); CREATE INDEX t1_pk2_idx ON ks.t1(pk2); * Query: CONSISTENCY LOCAL_QUORUM; SELECT * FROM ks.t1 where pk2 = 1; +------------+-------------------+-------------------+ \| Page Size \| Master \| Vector Limit \| +============+===================+===================+ \| \| Latency (sec) \| Latency (sec) \| +------------+-------------------+-------------------+ \| 100 \| 5.80 ± 0.13 \| 5.64 ± 0.10 \| +------------+-------------------+-------------------+ \| 1000 \| 4.77 ± 0.07 \| 4.62 ± 0.06 \| +------------+-------------------+-------------------+ \| 2000 \| 4.67 ± 0.07 \| 5.13 ± 0.03 \| +------------+-------------------+-------------------+ \| 5000 \| 4.82 ± 0.09 \| 6.25 ± 0.06 \| +------------+-------------------+-------------------+ \| 10000 \| 4.89 ± 0.36 \| 7.52 ± 0.13 \| +------------+-------------------+-------------------+ \| -1 \| 4.90 ± 0.67 \| 4.79 ± 0.33 \| +------------+-------------------+-------------------+ ``` We expect this to be fixed with adaptive paging in a future PR. Until then, users can avoid regressions by adjusting their page size. Additionally, this series changes the `untyped_result_set` to store rows in a `chunked_vector` instead of an `std::vector`, similarly to the `result_set`. Secondary index queries use an `untyped_result_set` to store the raw result from the index view before processing. With 1MiB results, the `std::vector` would cause a large allocation of this magnitude. Finally, a unit test is added to reproduce the bug. Fixes #18536. The PR fixes stalls of up to 100ms, but there is an easy workaround: adjust the page size. No need to backport. Closes scylladb/scylladb#22682 * github.com:scylladb/scylladb: cql3: secondary index: Limit page size for single-row partitions cql3: secondary index: Limit the size of partition range vectors cql3: untyped_result_set: Store rows in chunked_vector test: Reproduce bug with large allocations from secondary index	2025-03-14 15:06:07 +02:00
Botond Dénes	83ea1877ab	Merge 'scylla-sstable: add native S3 support' from Ernest Zaslavsky scylla-sstable: Enable support for S3-stored sstables Minimal implementation of what was mentioned in this [issue](https://github.com/scylladb/scylladb/issues/20532) This update allows Scylla to work with sstables stored on AWS S3. Users can specify the fully qualified location of the sstable using the format: `s3://bucket/prefix/sstable_name`. One should have `object_storage_config_file` referenced in the `scylla.yaml` as described in docs/operating-scylla/admin.rst ref: https://github.com/scylladb/scylladb/issues/20532 fixes: https://github.com/scylladb/scylladb/issues/20535 No backport needed since the S3 functionality was never released Closes scylladb/scylladb#22321 * github.com:scylladb/scylladb: tests: Add Tests for Scylla-SSTable S3 Functionality docs: Update Scylla Tools Documentation for S3 SSTable Support scylla-sstable: Enable Support for S3 SSTables s3: Implement S3 Fully Qualified Name Manipulation Functions object_storage: Refactor `object_storage.yaml` parsing logic	2025-03-14 15:05:52 +02:00
Patryk Jędrzejczak	ca5c223505	test: mark tests with the gossip-based recovery procedure This patch makes it clear which Raft recovery procedure is used in each test. Tests with "This test uses the gossip-based recovery procedure." are the tests that use the gossip-based topology. This tests should be deleted once we make the Raft-based topology mandatory. Tests with the new FIXME are the tests that use the Raft-based topology. They should be changed to use the Raft-based recovery procedure or removed if they don't test anything important with the new procedure.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	4fd0e93154	test: add tests for the Raft-based recovery procedure	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	4e055882c1	test: topology: util: fix the tokens consistency check for left nodes When we remove a node in the Raft-based topology (by remove/replace/decommission), we remove its tokens from `system.topology`, but we do not change `num_tokens`. Hence, the old check could fail for left nodes.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	d0efc77d20	test: topology: util: extend start_writes We extend `start_writes` to allow: - providing `ks_name` from the test, - restarting it (by starting it again with the same `ks_name`), - running it in the presence of shutdowns. We use these features in a new test in one of the following patches.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	9970c1fcc3	gossip: allow group 0 ID mismatch in the Raft-based recovery procedure This patch ensures that members of the new group 0 can gossip with members of the old group 0 during rolling restart in the Raft-based recovery procedure. Without this change, restarted nodes (members of the new group 0) wouldn't be marked as UP by other nodes (members of the old group 0), which would decrease availability.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	3b9765dac8	raft_group0: modify_raft_voter_status: do not add new members In the new Raft-based recovery procedure, we create a new group 0. Dead nodes are not members of this group 0. Also, the removenode handler makes a node being removed a non-voter. So, with the previous implementation of `modify_raft_voter_status`, the node being removed would become a non-voting member of the new group 0, which is very weird. It should not cause problems, but we better avoid it and keep the procedure clean. This change also makes `modify_raft_voter_status` more intuitive in general.	2025-03-14 13:53:05 +01:00
Patryk Jędrzejczak	fd51d7e448	treewide: allow recreating group 0 in the Raft-based recovery procedure This patch adds support for recreating group 0 after losing majority. This is the only part of the new Raft-based recovery procedure that touches Scylla core. The following steps are necessary to recreate group 0: 1. Determine the new group 0 members. These are alive nodes that are normal or rebuilding. 2. Choose the recovery leader - the node which will become the new group 0 leader. This must be one of the nodes with the latest persistent group 0 state. 3. Remove `raft_group_id` from `system.scylla_local` and truncate `system.discovery` on each live node. 4. Set the new scylla.yaml parameter - `recovery_leader` - to Host ID of the recovery leader on each live node. 5. Rolling restart all live nodes, but the recovery leader must be restarted first. In the implementation, restarts in step 5 are very similar to normal restarts with the Raft-based topology enabled. The only differences are: 1. Steps 3-4 make the restarting node discover the new group 0 in `join_cluster`. 2. The group 0 server is started in `join_group0`, not `setup_group0_if_exists`. 3. The restarting node joins the new group 0 in `join_topology` using `legacy_handshaker`. There is no reason to contact the topology coordinator since the node has already joined the topology. Unfortunately, this patch creates another execution path for the starting logic. `join_cluster` becomes even messier. However, there is nothing we can do about it. Joining group 0 without joining topology is something completely new. Having a few small changes without touching other execution paths is the best we can do. We will start removing the old stuff soon, after making the Raft-based topology mandatory, and the situation will improve.	2025-03-14 13:52:57 +01:00
Nadav Har'El	de7c1d526a	test/cqlpy: test DESC doesn't list an index as a view Issue #6058 complained that "DESCRIBE TABLE" or "DESCRIBE KEYSPACE" list a secondary index as materialized view (the view used to back the index in Scylla's implementation of secondary indexes). This patch adds a test to verify that this issue no longer exists in server-side describe - so we can mark the issue as fixed. While preparing this test, I noticed that Scylla and Cassandra behave differently on whether DESC TABLE should list materialized views or not, so this patch also includes a test for that as well - and I opened issue #23014 on Scylla and CASSANDRA-20365 on Cassandra to further discuss that new issue. Fixes #6058 Refs #23014. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23015	2025-03-14 14:40:19 +03:00
Nadav Har'El	c0821842de	alternator: document the state of tablet support in Alternator In commit `c24bc3b` we decided that creating a new table in Alternator will by default use vnodes - not tablets - because of all the missing features in our tablets implementation that are important for Alternator, namely - LWT, CDC and Alternator TTL. We never documented this, or the fact that we support a tag `experimental:initial_tablets` which allows to override this decision and create an Alternator table using tablets. We also never documented what exactly doesn't work when Alternator uses tablet. This patch adds the missing documentation in docs/alternator/new-apis.md (which is a good place for describing the `experimental:initial_tablets` tag). The patch also adds a new test file, test_tablets.py, which includes tests for all the statements made in the document regarding how `experimental:initial_tablets` works and what works or doesn't work when tablets are enabled. Two existing tests - for TTL and Streams non-support with tablets - are moved to the new test file. When the tablets feature will finally be completed, both the document and the tests will need to be modified (some of the tests should be outright deleted). But it seems this will not happen for at least several months, and that is too long to wait without accurate documentation. Fixes #21629 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#22462	2025-03-14 14:03:15 +03:00
Pavel Emelyanov	2bb455ec75	Merge 'Main: stop system_keyspace' from Benny Halevy This series adds an async guard to system_keyspace operations and adds a deferred action to stop the system_keyspace in main() before destroying the service. This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped. * Enhancement, no backport needed Closes scylladb/scylladb#23113 * github.com:scylladb/scylladb: main: stop system keyspace system_keyspace: call shutdown from stop system_keyspace: shutdown: allow calling more than once database, compaction_manager, large_data_handler: use pluggable<system_keysapce> utils: add class pluggable	2025-03-14 13:23:28 +03:00
Aleksandra Martyniuk	444c7eab90	repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream does not access erm. Pass small_table_optimization_params containing erm only when small_table_optimization is enabled. This is safe as erm is kept by shard_repair_task_impl.	2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk	e56bb5b6e2	repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary When small_table_optimization isn't enabled, flush_rows_in_working_row_buf does not access erm. Add small_table_optimization_params containing erm and pass it only when small_table_optimization is enabled. This is safe as erm is kept by shard_repair_task_impl.	2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk	09c74aa294	repair: pass session_id to repair_writer_impl::create_writer	2025-03-14 10:45:52 +01:00
Aleksandra Martyniuk	47bb9dcf78	repair: keep materialized topology guard in shard_repair_task_impl Keep materialized topology guard in shard_repair_task_impl and check it in check_in_abort_or_shutdown and before each range repair.	2025-03-14 10:41:10 +01:00
Aleksandra Martyniuk	928f92c780	repair: pass session_id to repair_meta Pass session_id of tablet repair down the stack from the repair request to repair_meta. The session_id will be utiziled in the following patches.	2025-03-14 10:20:12 +01:00
Nadav Har'El	a72dde2ee6	test/cqlpy: add test for long table names Scylla inherited a 48-character limit on the length of table (and keyspace) names from Cassandra 3. It turns out that Cassandra 4 and 5 unintentionally dropped this limit (see history lesson in CASSANDRA-20425), and now Cassandra accepts longer table names. Some Cassandra users are using such longer names and disappointed that Scylla doesn't allow them. This patch includes tests for this feature. One test tries a 48-character table name - it passes on Scylla and all versions of Cassandra. A second test tries a 100-character table name - this one passes on Cassandra version 4 and above (but not on 3), and fails on Scylla so marked "xfail". A third test tries a 500-character table name. This one fails badly on Cassandra (see CASSANDRA-20389), but passes on Scylla today. This test is important because we need to be sure that it continues to pass on Scylla even after the Scylla is fixed to allow the 100-character test. Refs #4480 - an issue we already have about supporting longer names Note on the test implementation: Ideally, the test for a particular table-name length shouldn't just create the table - it should also make sure we can write table to it and flush it, i.e., that sstables can get written correctly. But in practice, these complications are not needed, because in modern Scylla it is the directory name which contains the table's name, and the individual sstable files do not contain the table's name. Just creating the table already creates the long directory name, so that is the part that needs to be tested. If we created this directory successfully, later creating the short-named sstables inside it can't fail. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23229	2025-03-14 11:15:07 +03:00
Kefu Chai	a82cfbecad	test: perf_sstable: close frag_stream before destoying it the underlying reader should be closed before being destroyed. otherwise we'd have following failure when testing the "full_scan_streaming": ``` $ scylla perf-sstable --parallelism 1 --iterations 20 --partitions 20 --testdir /tmp/sstable --mode full_scan_streaming ERROR 2025-03-13 15:04:26,321 [shard 0:main] mutation_reader - N8sstables2mx27mx_sstable_full_scan_readerE [0x60015a36b650]: permit .:test: was not closed before destruction, at: 0x235931e 0x2359470 0x239deb3 0x62a1ed3 0x89fd156 0x89c3fba 0x22a6ed3 0x22a8fea 0x22aae17 0x22a9928 0x26bb7d0 0x26bbe3e 0x89bca67 0x246bd8d /lib64/libc.so.6+0x3247 /lib64/libc.so.6+0x330a 0x1657774 ------ seastar::internal::coroutine_traits_base<double>::promise_type ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23270	2025-03-14 11:12:44 +03:00
Piotr Smaron	d365d9b2ad	test/ldap: assign non-busy ports to ldap It may happen that the ports we randomly choose for LDAP are busy, and that'd fail the test suite, so once we randomly select ports, now we'll see if they're busy or not, and if they're busy, we'll select next ones, until we finally have some free ports for LDAP. Tested with: `./test.py ldap/ldap_connection_test --repeat 1000 -j 10`: before the fix, this command fails after ~112 runs, and of course it passes with the fix. Fixes: scylladb/scylla-enterprise#5120 Fixes: scylladb/scylladb#23149 Fixes: scylladb/scylladb#23242 Closes scylladb/scylladb#23275	2025-03-14 11:09:19 +03:00
Botond Dénes	68b2ac541c	Merge 'streaming: fix the way a reason of streaming failure is determined' from Aleksandra Martyniuk During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834. Needs backport to all live version, as they all contain the bug Closes scylladb/scylladb#22868 * github.com:scylladb/scylladb: streaming: fix the way a reason of streaming failure is determined streaming: save a continuation lambda streaming: use streaming namespace in table_check.{cc,hh} repair: streaming: move table_check.{cc,hh} to streaming	2025-03-14 07:25:00 +02:00
Kefu Chai	31320399e8	test: sstable_test: use `auto` instead of `statistics` to avoid name collision Replace explicit `statistics` type with `auto` in sstable_test to resolve name collision. This addresses ambiguity introduced by commit 87c221cb which added `struct statistics` in `seastar/include/seastar/net/api.hh`, conflicting with the existing definition in `scylladb/sstables/types.hh` when the `seastar` namespace is opened. The `auto` keyword avoids the need to explicitly reference either type, cleanly resolving the collision while maintaining functionality. This change prepares for the upcoming change to bump up seastar submodule. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23249	2025-03-13 22:51:21 +02:00
Avi Kivity	696ce4c982	Merge "convert some parts of the gossiper to host ids" from Gleb " This is series starts conversion of the gossiper to use host ids to index nodes. It does not touch the main map yet, but converts a lot of internal code to host id. There are also some unrelated cleanups that were done while working on the series. On of which is dropping code related to old shadow round. We replaced shadow round with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0, so there should be no compatibility problem. We already dropped a lot of old shadow round code in previous patches anyway. I tested manually that old and new node can co-exist in the same cluster, " * 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits) gossiper: drop unneeded code gossiper: move _expire_time_endpoint_map to host_id gossiper: move _just_removed_endpoints to host id gossiper: drop unused get_msg_addr function messaging_service: change connection dropping notification to pass host id only messaging_service: pass host id to remove_rpc_client in down notification treewide: pass host id to endpoint_lifecycle_subscriber treewide: drop endpoint life cycle subscribers that do nothing load_meter: move to host id treewide: use host id directly in endpoint state change subscribers treewide: pass host id to endpoint state change subscribers gossiper: drop deprecated unsafe_assassinate_endpoint operation storage_service: drop unused code in handle_state_removed treewide: drop endpoint state change subscribers that do nothing gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory gossiper: start using host ids to send messages earlier messaging_service: add temporary address map entry on incoming connection topology_coordinator: notify about IP change from sync_raft_topology_nodes as well treewide: move everyone to use host id based gossiper::is_alive and drop ip based one storage_proxy: drop unused template ...	2025-03-13 13:36:31 +02:00
Kefu Chai	5eba29e376	ent/encryption: correct misspellings these misspellings were flagged by codespell. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23254	2025-03-13 13:07:34 +02:00
Kefu Chai	9f411f9962	tools/scylla-nodetool: refactor to use std::tie() for cleaner code Replace explicit pair member access with std::tie() throughout scylla-nodetool. This simplifies the code by eliminating repetitive pair.first/pair.second references and makes the codebase more maintainable and readable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23250	2025-03-13 11:56:07 +02:00
Dawid Mędrek	0a6137218a	db/hints: Cancel draining when stopping node Draining hints may occur in one of the two scenarios: * a node leaves the cluster and the local node drains all of the hints saved for that node, * the local node is being decommissioned. Draining may take some time and the hint manager won't stop until it finishes. It's not a problem when decommissioning a node, especially because we want the cluster to retain the data stored in the hints. However, it may become a problem when the local node started draining hints saved for another node and now it's being shut down. There are two reasons for that: * Generally, in situations like that, we'd like to be able to shut down nodes as fast as possible. The data stored in the hints won't disappear from the cluster yet since we can restart the local node. * Draining hints may introduce flakiness in tests. Replaying hints doesn't have the highest priority and it's reflected in the scheduling groups we use as well as the explicitly enforced throughput. If there are a large number of hints to be replayed, it might affect our tests. It's already happened, see: scylladb/scylladb#21949. To solve those problems, we change the semantics of draining. It will behave as before when the local node is being decommissioned. However, when the local node is only being stopped, we will immediately cancel all ongoing draining processes and stop the hint manager. To amend for that, when we start a node and it initializes a hint endpoint manager corresponding to a node that's already left the cluster, we will begin the draining process of that endpoint manager right away. That should ensure all data is retained, while possibly speeding up the shutdown process. There's a small trade-off to it, though. If we stop a node, we can then remove it. It won't have a chance to replay hints it might've before these changes, but that's an edge case. We expect this commit to bring more benefit than harm. We also provide tests verifying that the implementation works as intended. Fixes scylladb/scylladb#21949 Closes scylladb/scylladb#22811	2025-03-13 11:55:15 +02:00
Paweł Zakrzewski	d483051e44	cql3/select_statement: reject aggregate functions when PER PARTITION LIMIT is present Before this patch we silently allowed and ignored PER PARTITION LIMIT. While using aggregate functions in conjunction with PER PARTITION LIMIT can make sense, we want to disable it until we can offer proper implementation, see #9879 for discussion. We want to match Cassandra, and for queries with aggregate functions it behaves as follows: - it silently ignores PER PARTITION LIMIT if GROUP BY is present, which matches our previous implementation. - rejects PER PARTITION LIMIT when GROUP BY is not present. This patch adds rejection of the second group. Fixes #9879 Closes scylladb/scylladb#23086	2025-03-13 10:29:53 +02:00
Pavel Emelyanov	f50bcbf4d0	test/perf/s3: Don't forget to stop sharded<tester> on error In case invoke_on_all(tester::start) throws, the sharded<tester> instance remains non-stopped and calltrace is reported on test stop. Not nice, fix it so that sharded<> thing is stopped in any case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23244	2025-03-13 09:54:09 +02:00
Anna Stuchlik	562b5db5b8	doc: Remove "experimental" from ALTER KEYSPACE with Tablets Altering a keyspace with tablets is no longer experimental. This commit removes the "Experimental" label from the feature. Fixes https://github.com/scylladb/scylladb/issues/23166 Closes scylladb/scylladb#23183	2025-03-12 17:41:36 +02:00
Kefu Chai	68fc067106	perf/perf_sstable: fix the indent Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-12 19:00:50 +08:00
Kefu Chai	4f62f79622	perf/perf_sstable: stop using at_exit() seastar::at_exit() was marked deprecated recently. so let's use the recommended approach to perform cleanups. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-03-12 19:00:50 +08:00
Nadav Har'El	3ca2e6ddda	Merge 's3_client: Add retries to Security Token Service/EC2 instance metadata credentials providers' from Ernest Zaslavsky Several updates and improvements to the retryable HTTP client functionality, as well as enhancements to error handling and integration with AWS services, as part of this PR. Below is a summary of the changes: - Moved the retryable HTTP client functionality out of the S3 client to improve modularity and reusability across other services like AWS STS. - Isolated the retryable_http_client into its own file, improving clarity and maintainability. - Added a make_request method that introduces a response-skipping handler. - Introduced a custom error handler constructor, providing greater flexibility in handling errors. - Updated the STS and Instance Metadata Service credentials providers to utilize the new retryable HTTP client, enhancing their robustness and reliability. - Extended the AWS error list to handle errors specific to the STS service, ensuring more granular and accurate error management for STS operations. - Enhanced error handling for system errors returned by Seastar’s HTTP client, ensuring smoother operations. - Properly closed the HTTP client in instance_profile_credentials_provider and sts_assume_role_credentials_provider to prevent resource leaks. - Reduced the log severity in the retry strategy to avoid SCT test failures that occur when any log message is tagged as an ERROR. No backport needed since we dont have any s3 related activity on the scylla side been released Closes scylladb/scylladb#21933 * github.com:scylladb/scylladb: s3_client: Adjust Log Severity in Retry Strategy aws_error: Enhance error handling for AWS HTTP client aws_error: Add STS specific error handling credentials_providers: Close retryable clients in Credentials Providers credentials_providers: Integrate retryable_http_client with Credentials Providers s3_client: enhance `retryable_http_client` functionality s3_client: isolate `retryable_http_client` s3_client: Prepare for `retryable_http_client` relocation s3_client: Remove `is_redirect_status` function s3_client: Move retryable functionality out of s3 client	2025-03-12 10:19:15 +02:00
Avi Kivity	b1d9f80d85	Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec Before this patch, the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogeneous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assumes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented. Also, per-shard goal for tablet count is still the same for all nodes in the cluster, so nodes with less capacity will be below limit and nodes with more capacity will be slightly above limit. This shouldn't be a significant problem in practice, we could compensate for this by increasing the limit. Refs #23042 Closes scylladb/scylladb#23079 * github.com:scylladb/scylladb: tablets: Make load balancing capacity-aware topology_coordinator: Fix confusing log message topology_coordinator: Refresh load stats after adding a new node topology_coordinator: Allow capacity stats to be refreshed with some nodes down topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places test: boost: tablets_test: Always provide capacity in load_stats test: perf_load_balancing: Set node capacity test: perf_load_balancing: Convert to topology_builder config, disk_space_monitor: Allow overriding capacity via config storage_service, tablets: Collect per-node capacity in load_stats	2025-03-11 14:34:27 +02:00
Gleb Natapov	57f2b6d825	gossiper: drop unneeded code host_id is already available at this point.	2025-03-11 12:09:22 +02:00
Gleb Natapov	cca228265e	gossiper: move _expire_time_endpoint_map to host_id Index _expire_time_endpoint_map map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	c45b50bbe6	gossiper: move _just_removed_endpoints to host id Index _just_removed_endpoints map by host id instead of ip	2025-03-11 12:09:22 +02:00
Gleb Natapov	22739bb39a	gossiper: drop unused get_msg_addr function	2025-03-11 12:09:22 +02:00
Gleb Natapov	b3720b80b6	messaging_service: change connection dropping notification to pass host id only Only host id is needed in the callback anyway.	2025-03-11 12:09:22 +02:00
Gleb Natapov	24d30073f9	messaging_service: pass host id to remove_rpc_client in down notification Do not iterate over all client indexed by hos id to search for those with given IP. Look up by host id directly since now we know it in down notification. In cases host id is not known look it up by ip.	2025-03-11 12:09:22 +02:00
Gleb Natapov	4ca627b533	treewide: pass host id to endpoint_lifecycle_subscriber	2025-03-11 12:09:22 +02:00
Gleb Natapov	8a747fbc2a	treewide: drop endpoint life cycle subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:22 +02:00
Gleb Natapov	525b88f877	load_meter: move to host id Use host id indexing in load_meter and only convert to ips on api level.	2025-03-11 12:09:22 +02:00
Gleb Natapov	48a1030c91	treewide: use host id directly in endpoint state change subscribers Now that we have host ids in endpoint state change subscribers some of them can be simplified by using the id directly instead of locking it up by ip.	2025-03-11 12:09:22 +02:00
Gleb Natapov	499eb4d17f	treewide: pass host id to endpoint state change subscribers	2025-03-11 12:09:22 +02:00
Gleb Natapov	eb59205caf	gossiper: drop deprecated unsafe_assassinate_endpoint operation It was always deprecated.	2025-03-11 12:09:21 +02:00
Gleb Natapov	c17a8b4a76	storage_service: drop unused code in handle_state_removed	2025-03-11 12:09:21 +02:00
Gleb Natapov	696aee3adc	treewide: drop endpoint state change subscribers that do nothing Provide default implementation for them instead. Will be easier to rework them later.	2025-03-11 12:09:21 +02:00
Gleb Natapov	7dcffda6bd	gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory	2025-03-11 12:09:21 +02:00
Gleb Natapov	8425c26462	gossiper: start using host ids to send messages earlier Send digest ack and ack2 by host ids as well now since the id->ip mapping is available after receiving digest syn. It allows to convert more code to host id here.	2025-03-11 12:09:21 +02:00
Gleb Natapov	f0af3f261e	messaging_service: add temporary address map entry on incoming connection We want to move to use host ids as soon as possible. Currently it is possible only after the full gossiper exchange (because only at this point gossiper state is added and with it address map entry). To make it possible to move to host ids earlier this patch adds address map entries on incoming communication during CLIENT_ID verb processing. The patch also adds generation to CLIENT_ID to use it when address map is updated. It is done so that older gossiper entries can be overwritten with newer mapping in case of IP change.	2025-03-11 12:09:21 +02:00
Gleb Natapov	c3035caeb5	topology_coordinator: notify about IP change from sync_raft_topology_nodes as well Currently sync_raft_topology_nodes() only send join notification if a node is new in the topology, but sometimes a node changes IP and the join notification should be send for the new IP as well. Usually it is done from ip_address_updater, but topology reload can run first and then the notification will be missed. The solution is to send notification during topology reload as well.	2025-03-11 12:09:21 +02:00
Gleb Natapov	0e3dcb7954	treewide: move everyone to use host id based gossiper::is_alive and drop ip based one	2025-03-11 12:09:21 +02:00
Gleb Natapov	56c6e04079	storage_proxy: drop unused template The storage_proxy::is_alive is called with host_id only.	2025-03-11 12:09:21 +02:00
Gleb Natapov	e47f251178	gossiper: move _live_endpoints and _unreachable_endpoints endpoint to host_id Index live and dead endpoints by host id. It also allows to simplify some code that does a translation.	2025-03-11 12:09:21 +02:00
Gleb Natapov	6f05608b5e	gossiper: chunk vector using std::views::chunk instead of explicitly code it	2025-03-11 12:09:21 +02:00
Gleb Natapov	0437f558cd	idl: generate ip based version of a verb only for verbs that need it The patch adds new marker for a verb - [[ip]] that means that for this verb ip version of the verbs needs to be generated. Most of the verbs do not need it.	2025-03-11 12:09:21 +02:00
Gleb Natapov	3734afe8a5	gossiper: send shutdown notification by host id	2025-03-11 12:09:21 +02:00
Gleb Natapov	ee59baf6fc	gossiper: drop old shadow round code It is no longer used. It was replaced with explicit GOSSIP_GET_ENDPOINT_STATES verb in `cd7d64f588` which is in scylla-4.3.0	2025-03-11 12:09:20 +02:00
Gleb Natapov	f1a82c1d01	gossiper: drop unused get_endpoint_states function	2025-03-11 12:09:20 +02:00
Gleb Natapov	c4a0fbae16	gossiper: check id match inside force_remove_endpoint Before calling force_remove_endpoint (which works on ip) the code checks that the ip maps to the correct id (not not remove a new node that inherited this ip by mistake). Move the check to the function itself.	2025-03-11 12:09:20 +02:00
Gleb Natapov	52c9217f1b	migration_manager: drop unneeded id to ip translation	2025-03-11 12:09:20 +02:00
Gleb Natapov	4420ddaf86	gossiper: move is_gossip_only_member and its users to work on host id	2025-03-11 12:09:20 +02:00
Gleb Natapov	cb2b874942	table: use host id based get_endpoint_state_ptr and skip id->ip translation	2025-03-11 12:09:20 +02:00
Gleb Natapov	2746d391af	gossiper: do not ping outdated address A node may change its IP but some other node in the cluster may still try to ping it using an old IP because it may receive an outdated gossiper entry with the old IP. Do not send echo message to the old IP. It will cause a misusing UP message with old address to be printed.	2025-03-11 12:09:20 +02:00
Gleb Natapov	aaba55073d	storage_service: drop outdated code that checks whether raft topology should be used After raft_topology_change_enabled() was introduced the code does nothing useful. The function is responsible for the decision if raft topology is enabled or not.	2025-03-11 12:09:20 +02:00
Gleb Natapov	6952f62869	gossiper: drop unused field from loaded_endpoint_state	2025-03-11 12:09:20 +02:00
Nikos Dragazis	7a6a4f54a5	cql3: secondary index: Limit page size for single-row partitions The size of the partition range vector was constrained in the previous patch. Any rows beyond the vector's capacity are discarded. In the special case of single-row partitions, we know the size of each partition, so we can enforce this limit on the query itself via the page size. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-10 12:18:49 +02:00
Nikos Dragazis	76b31a3acc	cql3: secondary index: Limit the size of partition range vectors The partition range vector is an std::vector, which means it performs contiguous allocations. Large allocations are known to cause problems (e.g., reactor stalls). For paged queries, limit the vector size to 1000. If more partition keys are available in the query result, discard them. Ideally, we should not be fetching them at all, but this is not possible without knowing the size of each partition. Currently, each vector element is 120 bytes and the standard allocator's max preferred contiguous allocation is 128KiB. Therefore, the chosen value of 1000 satisfies the constraint (128 KiB / 120 = 1092 > 1000). This should be good enough for most cases. Since secondary index queries involve one base table query per partition key, these queries are slow. A higher limit would only make them slower and increase the probability of a timeout. For the same reason, saving a follow-up paged request from the client would not increase the efficiency much. For unpaged queries, do not apply any limit. This means they remain susceptible to stalls, but unpaged queries are considered unoptimized anyway. Finally, update the unit test reproducer since the bug is now fixed. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-10 12:18:42 +02:00
Pavel Emelyanov	db70c7bbf7	api: Remove the remaining parse_tables() overload There's only one caller of it left -- the scrub handler. It can use the parse_table_infos() one and get table names from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:14:10 +03:00
Pavel Emelyanov	89f3c1a91e	database: Sanitize flush_tables_on_all_shards() Previous patch left this method with few uglinesses - the vector<table_id> argument is named table_names - the sstring keyspace argument is unused - the keyspace argument is captured for no use Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:13:10 +03:00
Pavel Emelyanov	0f9cc956f4	schema_tables: Remove all_table_names() Now it's unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:12:56 +03:00
Pavel Emelyanov	c2d23d7948	database: Make tables flushing helper use table_info-s, not names The database::flush_tables_on_all_shards() method accepts a keyspace name and a vector of table names. Then it converts ks:cf pair for each of the table name into a table-id and flushes the table with the ID. All the callers of that method already have or can easily get the vector of table_id-s, not just names, so make use of this. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:11:32 +03:00
Pavel Emelyanov	e94dce1725	api: Make keyspace flush endpoint use parse_table_infos() (and a bit more) Currently the handler in question calls parse_tables() which returns empty list of tables in the "cf" parameter is missing, or the table names if it's present. In the former case the handler will call flush_keyspace_on_all_shards() that just gets all table names from the keyspace and flushes them all. This change makes the handler use parse_table_infos() which is different -- when the "cf" parameter is missing, it gets all tables from the keyspace. So the handler no longer need to call the keyspace flush, it can always call the "flush the list of tables" helper. With that change one of the parse_tables() helpers becomes unused, so remove it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:06:55 +03:00
Pavel Emelyanov	5a897d7368	schema_tables,client_state: Switch to using all_table_infos() There are few more places left that can use all_table_infos() as a replacement for all_table_names(), patch them. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:05:59 +03:00
Pavel Emelyanov	da05765746	schema_tables: Tune up some methods to benefit from table_infos There are convert_schema_to_mutations() and calculate_schema_digest() that collect table names and then use them to find schema and query mutations from the table. Both can use the newly introduced all_table_infos() and use the returned table_id-s to do the same, thus avoiding re-lookups (which are fast anyway, but still). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 13:01:50 +03:00
Pavel Emelyanov	d7bfa5a545	schema_tables: Introduce all_table_infos() This method is like all_table_names(), but returns a vector of table_info-s which is effectively a pair of string name and uuid id. To be used later, and the string-returning all_table_name() will be removed very soon too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-03-10 12:59:03 +03:00
Ernest Zaslavsky	c8de7619e5	s3_client: Adjust Log Severity in Retry Strategy * Reduced log severity in retry_strategy. * Rationale: SCT fails tests when any message is logged as ERROR.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	8e46929474	aws_error: Enhance error handling for AWS HTTP client - Seastar's HTTP client is known to throw exceptions for various reasons, including network errors, TLS errors and other transient issues. - Update error handling to correctly capture and process all exceptions from Seastar's HTTP client. - Previously, only aws_exception was handled, causing retryable errors to be missed and `should_retry` not invoked. - Now, all exceptions trigger the appropriate retry logic per the intended strategy. - Add tests for the S3 proxy to ensure robustness and reliability of these enhancements.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	92a12c96a2	aws_error: Add STS specific error handling Updated the AWS error list to include handling for errors specific to the STS service. This enhancement ensures more comprehensive error management for STS-related operations.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	a371d6cf62	credentials_providers: Close retryable clients in Credentials Providers Updated `instance_profile_credentials_provider` and `sts_assume_role_credentials_provider` to close the HTTP client appropriately.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	45a6e88954	credentials_providers: Integrate retryable_http_client with Credentials Providers * Updated STS and Instance Metadata Service credentials providers to utilize retryable_http_client.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	7c49ee4520	s3_client: enhance `retryable_http_client` functionality Enhanced `retryable_http_client` by allowing the injection of a custom error handler through its constructor.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	b589a882bb	s3_client: isolate `retryable_http_client` Relocated `retryable_http_client` into its own dedicated file for improved clarity and maintainability.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	5eff83af95	s3_client: Prepare for `retryable_http_client` relocation Expose `map_s3_client_exception` outside the S3 client class to facilitate moving `retryable_http_client` to a separate file.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	2b3abba10a	s3_client: Remove `is_redirect_status` function Eliminate the `is_redirect_status` function in favor of the equivalent functionality provided by Seastar's HTTP client.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	5b7d4a4136	s3_client: Move retryable functionality out of s3 client This commit moves the retryable HTTP client functionality out of the S3 client implementation. Since this functionality is also required for other services, such as AWS STS, it has been separated to ensure broader applicability.	2025-03-10 09:01:47 +02:00
Ernest Zaslavsky	050c3cdbc2	tests: Add Tests for Scylla-SSTable S3 Functionality Extended existing Scylla Tools tests to cover the new functionality of reading SSTables from S3. This ensures that the new S3 integration is thoroughly tested and performs as expected.	2025-03-09 10:17:48 +02:00
Ernest Zaslavsky	112b4c8764	docs: Update Scylla Tools Documentation for S3 SSTable Support Updated the Scylla Tools documentation to include changes related to the enhanced support for S3-stored SSTables. This update ensures that the documentation accurately reflects the latest functionality and improvements.	2025-03-09 09:50:37 +02:00
Ernest Zaslavsky	17e3c01f4e	scylla-sstable: Enable Support for S3 SSTables Configure the sstable manager to correctly handle storage options based on the input type (local or S3-stored sstables). This tweak allows for mixing both storage types within a single call, improving flexibility and functionality.	2025-03-09 09:50:36 +02:00
Ernest Zaslavsky	88c4fa6569	s3: Implement S3 Fully Qualified Name Manipulation Functions Added utility functions to handle S3 Fully Qualified Names (FQN). These functions enable parsing, splitting, and identification of S3 paths, enhancing our ability to work with S3 object storage more effectively.	2025-03-09 09:50:36 +02:00
Ernest Zaslavsky	38165fd285	object_storage: Refactor `object_storage.yaml` parsing logic Refactored the parsing of `object_storage.yaml` out of Scylla's `main` function. This change is made to facilitate reusability of the parsing logic in other parts of the codebase.	2025-03-09 09:50:36 +02:00
Vlad Zolotarov	f7e1695068	CQL Tracing: set common query parameters in a single function Each query-type (QUERY, EXECUTE, BATCH) CQL opcode has a number of parameters in their payload which we always want to record in the Tracing object. Today it's a Consistency Level, Serial Consistency Level and a Default Timestamp. Setting each of them individually can lead to a human error when one (or more) of them would not be set. Let's eliminate such a possibility by defining a single function that sets them all. This also allows an easy addition of such parameters to this function in the future.	2025-03-06 09:30:51 -05:00
Aleksandra Martyniuk	35bc1fe276	streaming: fix the way a reason of streaming failure is determined During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834.	2025-03-06 15:07:14 +01:00
Aleksandra Martyniuk	44748d624d	streaming: save a continuation lambda In the following patches, an additional preemption point will be added to the coroutine lambda in register_stream_mutation_fragments. Assign a lambda to a variable to prolong the captures lifetime.	2025-03-06 15:07:09 +01:00
Tomasz Grabiec	c4714180cc	tablets: Make load balancing capacity-aware Before this patch the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogenous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assummes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented.	2025-03-06 13:35:38 +01:00
Tomasz Grabiec	3c0b733943	topology_coordinator: Fix confusing log message There can be other reasons the plan is empty, tablets may not actually be balanced. For example, capacity for all the nodes may not be known, or nodes may be down.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	40414c4985	topology_coordinator: Refresh load stats after adding a new node Stats are refreshed every minute by default. Load balancing cannot happen without capacity information for all normal nodes. To avoid the delay, trigger refresh after adding a new node.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	d6f8810e66	topology_coordinator: Allow capacity stats to be refreshed with some nodes down With capacity-aware balancing, if we're missing capacity for a normal node, we won't be able to proceed with tablet drain. Consider the following scenario: 1. Nodes: A, B 2. refresh stats with A and B 3. Add node C 4. Node B goes down 5. removenode B starts 6. stats refreshing fails because B is down If we don't have capacity stats for node C, load balancer cannot make decisions and removenode is blocked indefinitely. A reproducer is added in this patch. To alleviate that, we allow capacity stats to be collected for nodes which are reachable, we just don't update the table size part. To keep table stats monotonic, we cache previous results per node, so even if it's unreachable now, we use its last reported sizes. It's still more accurate than not refreshing stats at all. A node can be down for a long period, and other replicas can grow in size. It's not perfect, because the stale node can skew the stats in its direction, but ignoring it completely has its pitfalls too. Better solution is left for later.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	af3dce4c8a	topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places Use serialized_action for serialization and batching.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	69c49fb1a7	test: boost: tablets_test: Always provide capacity in load_stats Move shared_load_stats to topology_builder.hh so that topology_builder can maintain it. It will set capacity for all created nodes. Needed after load balancer requires capacity to make decisions.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	dfc9101dfd	test: perf_load_balancing: Set node capacity Otherwise, load balancer will not make any plan once it becomes capacity-aware.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	6169401dbc	test: perf_load_balancing: Convert to topology_builder The test no longer worked becuase load balancer requires proper schema in the database now. Convert to topology_builder which builds topology in the database and create schema in the database (which needs proper topology).	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	d01cc16d1e	config, disk_space_monitor: Allow overriding capacity via config Intended for testing, or hot-fixing out-of-space issues in production. Tablet load balancer uses this information for determining per-shard load so reducing capacity will cause tablets to be migrated away from the node.	2025-03-06 13:35:37 +01:00
Tomasz Grabiec	7e7f1e6f91	storage_service, tablets: Collect per-node capacity in load_stats New RPC is introduced becuase load_stats was marked "final" in the IDL. Will be needed by capacity-aware load balancing.	2025-03-06 12:17:32 +01:00
Vlad Zolotarov	ca6bddef35	transport/server.cc: set default timestamp info in EXECUTE and BATCH tracing A default timestamp (not to confuse with the timestamp passed via 'USING TIMESTAMP' query clause) can be set using 0x20 flag and the <timestamp> field in the binary CQL frame payload of QUERY, EXECUTE and BATCH ops. It also happens to be a default of a Java CQL Driver. However, we were only setting the corresponding info in the CQL Tracing context of a QUERY operation. For an unknown reason we were not setting this for an EXECUTE and for a BATCH traces (I guess I simply forgot to set it back then). This patch fixes this. Fixes #23173	2025-03-05 20:37:37 -05:00
Aleksandra Martyniuk	faf3aa13db	streaming: use streaming namespace in table_check.{cc,hh}	2025-03-05 11:00:03 +01:00
Aleksandra Martyniuk	876cf32e9d	repair: streaming: move table_check.{cc,hh} to streaming	2025-03-05 11:00:03 +01:00
Benny Halevy	8ae8275f17	main: stop system keyspace To prevent internal queries coming from system_keyspace (like updating compaction history, for example) Refs scylladb/scylla-dtest#5581 Refs #22886 Refs #8995 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	7a624e3df8	system_keyspace: call shutdown from stop and use that to replace the explicit shutdown when stopped in cql_test_env. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:23 +02:00
Benny Halevy	102aec64d5	system_keyspace: shutdown: allow calling more than once Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:30:22 +02:00
Benny Halevy	fba88bdd62	database, compaction_manager, large_data_handler: use pluggable<system_keysapce> To allow safe plug and unplug of the system_keyspace. This patch follows-up on `917fdb9e53` (more specifically - `f9b57df471`) Since just keeping a shared_ptr<system_keyspace> doesn't prevent stopping the system_keyspace shards, while using the `pluggable` interface allows safe draining of outstanding async calls on shutdown, before stopping the system_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:27:23 +02:00
Benny Halevy	13a22cb6fd	utils: add class pluggable A wrapper around a shared service allowing safe plug and unplug of the service from its user using a phased-barrier operation permit guarding the service while in use. Also add a unit test for this class. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:25:50 +02:00
Nikos Dragazis	03902e5f17	cql3: untyped_result_set: Store rows in chunked_vector The `untyped_result_set` stores rows in std::vector. Switch to `chunked_vector` to prevent large allocations and data copies. One such case is in secondary index queries, where we convert the result of the internal index view query into an `untyped_result_set` for processing. The result is bound by the page size memory limit (1MiB by default), so it can cause large allocations of this magnitude. This patch aligns `untyped_result_set` with `result_set`, which also uses a `chunked_vector`. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-04 18:39:32 +02:00
Nikos Dragazis	892690b953	test: Reproduce bug with large allocations from secondary index Secondary index queries which fetch partitions from the base table can cause large allocations that can lead to reactor stalls. Reproduce this with a unit test that runs an indexed query on a table with thousands of single-row partitions, and checks the memory stats for any large contiguous allocations. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-03-04 18:39:28 +02:00

1196 changed files with 54096 additions and 16024 deletions

14

.github/CODEOWNERS vendored

View File

@@ -1,5 +1,5 @@
 # AUTH
 auth/* @nuivall @ptrsmrn @KrzaQ
 auth/* @nuivall @ptrsmrn
 # CACHE
 row_cache* @tgrabiec
@@ -25,15 +25,15 @@ compaction/* @raphaelsc
 transport/*
 # CQL QUERY LANGUAGE
 cql3/* @tgrabiec @nuivall @ptrsmrn @KrzaQ
 cql3/* @tgrabiec @nuivall @ptrsmrn
 # COUNTERS
 counters* @nuivall @ptrsmrn @KrzaQ
 tests/counter_test* @nuivall @ptrsmrn @KrzaQ
 counters* @nuivall @ptrsmrn
 tests/counter_test* @nuivall @ptrsmrn
 # DOCS
 docs/* @annastuchlik @tzach
 docs/alternator @annastuchlik @tzach @nyh @nuivall @ptrsmrn @KrzaQ
 docs/alternator @annastuchlik @tzach @nyh
 # GOSSIP
 gms/* @tgrabiec @asias @kbr-scylla
@@ -74,8 +74,8 @@ streaming/* @tgrabiec @asias
 service/storage_service.* @tgrabiec @asias
 # ALTERNATOR
 alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
 test/alternator/* @nyh @nuivall @ptrsmrn @KrzaQ
 alternator/* @nyh
 test/alternator/* @nyh
 # HINTED HANDOFF
 db/hints/* @piodul @vladzcloudius @eliransin

									
										97

.github/ISSUE_TEMPLATE/bug_report.yml
									
										vendored
									
												View File
												
				@@ -1,15 +1,86 @@

				This is Scylla's bug tracker, to be used for reporting bugs only.

				name: "Report a bug"

				description: "File a bug report."

				title: "[Bug]: "

				type: "bug"

				labels: bug

				body:

				  - type: checkboxes

				    id: terms

				    attributes:

				      label: Code of Conduct

				      description: "This is Scylla's bug tracker, to be used for reporting bugs only.

				If you have a question about Scylla, and not a bug, please ask it in

				our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.

				our forum at https://forum.scylladb.com/ or in our slack channel https://slack.scylladb.com/ "

				      options:

				        - label: I have read the disclaimer above and am reporting a suspected malfunction in Scylla.

				          required: true

				- [] I have read the disclaimer above, and I am reporting a suspected malfunction in Scylla.

				*Installation details*

				Scylla version (or git commit hash):

				Cluster size:

				OS (RHEL/CentOS/Ubuntu/AWS AMI):

				*Hardware details (for performance issues)*          Delete if unneeded

				Platform (physical/VM/cloud instance type/docker):

				Hardware: sockets= cores= hyperthreading= memory=

				Disks: (SSD/HDD, count)

				  - type: input

				    id: product-version

				    attributes:

				      label: product version

				      description: Scylla version (or git commit hash)

				      placeholder: ex. scylla-6.1.1

				    validations:

				      required: true

				  - type: input

				    id: cluster-size

				    attributes:

				      label: Cluster Size

				    validations:

				      required: true  

				  - type: input

				    id: os

				    attributes:

				      label: OS

				      placeholder: RHEL/CentOS/Ubuntu/AWS AMI

				    validations:

				      required: true

				  - type: textarea

				    id: additional-data

				    attributes:

				      label: Additional Environmental Data

				      #description: 

				      placeholder: Add additional data

				      value: "Platform (physical/VM/cloud instance type/docker):\n

				Hardware: sockets=   cores=   hyperthreading=   memory=\n

				Disks: (SSD/HDD, count)"

				    validations:

				      required: false

				  - type: textarea

				    id: reproducer-steps

				    attributes:

				      label: Reproduction Steps

				      placeholder: Describe how to reproduce the problem

				      value: "The steps to reproduce the problem are:"

				    validations:

				      required: true

				  - type: textarea

				    id: the-problem

				    attributes:

				      label: What is the problem?

				      placeholder: Describe the problem you found

				      value: "The problem is that"

				    validations:

				      required: true

				  - type: textarea

				    id: what-happened

				    attributes:

				      label: Expected behavior?

				      placeholder: Describe what should have happened

				      value: "I expected that "

				    validations:

				      required: true

				  - type: textarea

				    id: logs

				    attributes:

				      label: Relevant log output

				      description: Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks.

				      render: shell

									
										50

.github/scripts/auto-backport.py
									
										vendored
									
												View File
												
				@@ -52,7 +52,7 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr

				        if is_draft:

				            backport_pr.add_to_labels("conflicts")

				            pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"

				            pr_comment += "Please resolve them and mark this PR as ready for review"

				            pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."

				            backport_pr.create_issue_comment(pr_comment)

				        logging.info(f"Assigned PR to original author: {pr.user}")

				        return backport_pr

				@@ -112,29 +112,45 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):

				                    is_draft = True

				                    repo_local.git.add(A=True)

				                    repo_local.git.cherry_pick('--continue')

				            repo_local.git.push(fork_repo, new_branch_name, force=True)

				            create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,

				                                is_draft, is_collaborator)

				            # Check if the branch already exists in the remote fork

				            remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)

				            if not remote_refs:

				                # Branch does not exist, create it with a regular push

				                repo_local.git.push(fork_repo, new_branch_name)

				                create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,

				                                    is_draft, is_collaborator)

				            else:

				                logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")

				        except GitCommandError as e:

				            logging.warning(f"GitCommandError: {e}")

				def with_github_keyword_prefix(repo, pr):

				    pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"

				    match = re.findall(pattern, pr.body, re.IGNORECASE)

				    if not match:

				        for commit in pr.get_commits():

				            match = re.findall(pattern, commit.commit.message, re.IGNORECASE)

				            if match:

				                print(f'{pr.number} has a valid close reference in commit message {commit.sha}')

				                break

				    if not match:

				        print(f'No valid close reference for {pr.number}')

				        return False

				    else:

				    # GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs

				    github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"

				    # JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92

				    jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"

				    # Check PR body for GitHub issues

				    github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)

				    # Check PR body for JIRA issues

				    jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)

				    match = github_match or jira_match

				    if match:

				        return True

				    for commit in pr.get_commits():

				        github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)

				        jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)

				        if github_match or jira_match:

				            print(f'{pr.number} has a valid close reference in commit message {commit.sha}')

				            return True

				    print(f'No valid close reference for {pr.number}')

				    return False

				def main():

				    args = parse_args()

									
										16

.github/seastar-bad-include.json
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,16 @@

				{

				    "problemMatcher": [

				        {

				            "owner": "seastar-bad-include",

				            "severity": "error",

				            "pattern": [

				                {

				                    "regexp": "^(.+):(\\d+):(.+)$",

				                    "file": 1,

				                    "line": 2,

				                    "message": 3

				                }

				            ]

				        }

				    ]

				}

									
										2

.github/workflows/backport-pr-fixes-validation.yaml
									
										vendored
									
												View File
												
				@@ -18,7 +18,7 @@ jobs:

				            // Regular expression pattern to check for "Fixes" prefix

				            // Adjusted to dynamically insert the repository full name

				            const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;

				            const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;

				            const regex = new RegExp(pattern);

				            if (!regex.test(body)) {

									
										53

.github/workflows/call_backport_with_jira.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,53 @@

				name: Backport with Jira Integration

				on:

				  push:

				    branches:

				      - master

				      - next-*.*

				      - branch-*.*

				  pull_request_target:

				    types: [labeled, closed]

				    branches: 

				      - master

				      - next

				      - next-*.*

				      - branch-*.*

				jobs:

				  backport-on-push:

				    if: github.event_name == 'push'

				    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main

				    with:

				      event_type: 'push'

				      base_branch: ${{ github.ref }}

				      commits: ${{ github.event.before }}..${{ github.sha }}

				    secrets:

				      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

				  backport-on-label:

				    if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'

				    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main

				    with:

				      event_type: 'labeled'

				      base_branch: refs/heads/${{ github.event.pull_request.base.ref }}

				      pull_request_number: ${{ github.event.pull_request.number }}

				      head_commit: ${{ github.event.pull_request.base.sha }}

				      label_name: ${{ github.event.label.name }}

				      pr_state: ${{ github.event.pull_request.state }}

				    secrets:

				      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

				  backport-chain:

				    if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true

				    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main

				    with:

				      event_type: 'chain'

				      base_branch: refs/heads/${{ github.event.pull_request.base.ref }}

				      pull_request_number: ${{ github.event.pull_request.number }}

				      pr_body: ${{ github.event.pull_request.body }}

				    secrets:

				      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										133

.github/workflows/conflict_reminder.yaml
									
										vendored
									
												View File
												
				@@ -1,9 +1,16 @@

				name: Notify PR Authors of Conflicts

				permissions:

				  issues: write

				  pull-requests: write

				on:

				  push:

				    branches:

				      - 'master'

				      - 'branch-*'

				  schedule:

				    - cron: '0 10 * * 1,4'  # Runs every Monday and Thursday at 10:00am

				  workflow_dispatch:      # Manual trigger for testing

				jobs:

				  notify_conflict_prs:

				@@ -14,32 +21,126 @@ jobs:

				        uses: actions/github-script@v7

				        with:

				          script: |

				            console.log("Starting conflict reminder script...");

				            // Print trigger event

				            if (process.env.GITHUB_EVENT_NAME) {

				              console.log(`Workflow triggered by: ${process.env.GITHUB_EVENT_NAME}`);

				            } else {

				              console.log("Could not determine workflow trigger event.");

				            }

				            const isPushEvent = process.env.GITHUB_EVENT_NAME === 'push';

				            console.log(`isPushEvent: ${isPushEvent}`);

				            const twoMonthsAgo = new Date();

				            twoMonthsAgo.setMonth(twoMonthsAgo.getMonth() - 2);

				            const prs = await github.paginate(github.rest.pulls.list, {

				              owner: context.repo.owner,

				              repo: context.repo.repo,

				              state: 'open',

				              per_page: 100

				            });

				            console.log(`Fetched ${prs.length} open PRs`);

				            const recentPrs = prs.filter(pr => new Date(pr.created_at) >= twoMonthsAgo);

				            const validBaseBranches = ['master'];

				            const branchPrefix = 'branch-';

				            const threeDaysAgo = new Date();

				            const conflictLabel = 'conflicts';          

				            const conflictLabel = 'conflicts';

				            threeDaysAgo.setDate(threeDaysAgo.getDate() - 3);

				            for (const pr of prs) {

				              if (!pr.base.ref.startsWith(branchPrefix)) continue;

				              const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);

				              if (!hasConflictLabel) continue;

				            console.log(`Three days ago: ${threeDaysAgo.toISOString()}`);

				            for (const pr of recentPrs) {

				              console.log(`Checking PR #${pr.number} on base branch '${pr.base.ref}'`);

				              const isBranchX = pr.base.ref.startsWith(branchPrefix);

				              const isMaster = validBaseBranches.includes(pr.base.ref);

				              if (!(isBranchX || isMaster)) {

				                console.log(`PR #${pr.number} skipped: base branch is not 'master' or does not start with '${branchPrefix}'`);

				                continue;

				              }

				              const updatedDate = new Date(pr.updated_at);

				              if (updatedDate >= threeDaysAgo) continue;

				              if (pr.assignee === null) continue;

				              const assignee = pr.assignee.login;

				              if (assignee) {

				                await github.rest.issues.createComment({

				              console.log(`PR #${pr.number} last updated at: ${updatedDate.toISOString()}`);

				              if (!isPushEvent && updatedDate >= threeDaysAgo) {

				                console.log(`PR #${pr.number} skipped: updated within last 3 days`);

				                continue;

				              }

				              if (pr.assignee === null) {

				                console.log(`PR #${pr.number} skipped: no assignee`);

				                continue;

				              }

				              // Fetch PR details to check mergeability

				              let { data: prDetails } = await github.rest.pulls.get({

				                owner: context.repo.owner,

				                repo: context.repo.repo,

				                pull_number: pr.number,

				              });

				              console.log(`PR #${pr.number} mergeable: ${prDetails.mergeable}`);

				              // Wait and re-fetch if mergeable is null

				              if (prDetails.mergeable === null) {

				                console.log(`PR #${pr.number} mergeable is null, waiting 2 seconds and retrying...`);

				                await new Promise(resolve => setTimeout(resolve, 2000)); // wait 2 seconds

				                prDetails = (await github.rest.pulls.get({

				                  owner: context.repo.owner,

				                  repo: context.repo.repo,

				                  issue_number: pr.number,

				                  body: `@${assignee}, this PR has been open with conflicts. Please resolve the conflicts so we can merge it.`,

				                });

				                console.log(`Notified @${assignee} for PR #${pr.number}`);

				              } 

				                  pull_number: pr.number,

				                })).data;

				                console.log(`PR #${pr.number} mergeable after retry: ${prDetails.mergeable}`);

				              }

				              if (prDetails.mergeable === false) {

				                const hasConflictLabel = pr.labels.some(label => label.name === conflictLabel);

				                console.log(`PR #${pr.number} has conflict label: ${hasConflictLabel}`);

				                if (

				                  isPushEvent &&

				                  pr.draft === true &&

				                  hasConflictLabel

				                ) {

				                  // Fetch comments to find last bot notification

				                  const comments = await github.paginate(github.rest.issues.listComments, {

				                    owner: context.repo.owner,

				                    repo: context.repo.repo,

				                    issue_number: pr.number,

				                    per_page: 100,

				                  });

				                  // Find last notification comment from the bot (by body and user)

				                  const botLogin = context.actor;

				                  const notificationPrefix = `@${pr.assignee.login}, this PR has merge conflicts with the base branch.`;

				                  const lastNotification = comments

				                    .filter(c =>

				                      c.user.type === "Bot" &&

				                      c.body.startsWith(notificationPrefix)

				                    )

				                    .sort((a, b) => new Date(b.created_at) - new Date(a.created_at))[0];

				                  if (lastNotification) {

				                    const lastNotified = new Date(lastNotification.created_at);

				                    if (lastNotified >= threeDaysAgo) {

				                      console.log(`PR #${pr.number} skipped: last notification was less than 3 days ago`);

				                      continue;

				                    }

				                  }

				                }

				                if (!hasConflictLabel) {

				                  await github.rest.issues.addLabels({

				                    owner: context.repo.owner,

				                    repo: context.repo.repo,

				                    issue_number: pr.number,

				                    labels: [conflictLabel],

				                  });

				                  console.log(`Added 'conflicts' label to PR #${pr.number}`);

				                }

				                const assignee = pr.assignee.login;

				                if (assignee) {

				                  await github.rest.issues.createComment({

				                    owner: context.repo.owner,

				                    repo: context.repo.repo,

				                    issue_number: pr.number,

				                    body: `@${assignee}, this PR has merge conflicts with the base branch. Please resolve the conflicts so we can merge it.`,

				                  });

				                  console.log(`Notified @${assignee} for PR #${pr.number}`);

				                }

				              } else {

				                console.log(`PR #${pr.number} is mergeable, no action needed.`);

				              }

				            }

				            console.log(`Total PRs checked: ${prs.length}`);

									
										24

.github/workflows/iwyu.yaml
									
										vendored
									
												View File
												
				@@ -11,7 +11,8 @@ env:

				  CLEANER_OUTPUT_PATH: build/clang-include-cleaner.log

				  # the "idl" subdirectory does not contain C++ source code. the .hh files in it are

				  # supposed to be processed by idl-compiler.py, so we don't check them using the cleaner

				  CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops redis replica

				  CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service

				  SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log

				permissions: {}

				@@ -80,7 +81,24 @@ jobs:

				          done

				      - run: |

				          echo "::remove-matcher owner=clang-include-cleaner::"

				      - run: |

				          echo "::add-matcher::.github/seastar-bad-include.json"

				      - name: check for seastar includes

				        run: |

				          git -c safe.directory="$PWD"    \

				            grep -nE '#include +"seastar/' \

				            | tee "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH"

				      - run: |

				          echo "::remove-matcher owner=seastar-bad-include::"

				      - uses: actions/upload-artifact@v4

				        with:

				          name: Logs (clang-include-cleaner)

				          path: "./${{ env.CLEANER_OUTPUT_PATH }}"

				          name: Logs

				          path: |

				            ${{ env.CLEANER_OUTPUT_PATH }}

				            ${{ env.SEASTAR_BAD_INCLUDE_OUTPUT_PATH }}

				      - name: fail if seastar headers are included as an internal library

				        run: |

				          if [ -s "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH" ]; then

				            echo "::error::Found #include \"seastar/ in the source code. Use angle brackets instead."

				            exit 1

				          fi

									
										7

.github/workflows/make-pr-ready-for-review.yaml
									
										vendored
									
												View File
												
				@@ -16,6 +16,13 @@ jobs:

				      pull-requests: write

				    steps:

				      - name: Checkout repository

				        uses: actions/checkout@v4

				        with:

				          repository: ${{ github.repository }}

				          ref: ${{ env.DEFAULT_BRANCH }}

				          token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				          fetch-depth: 1

				      - name: Mark pull request as ready for review

				        run:  gh pr ready "${{ github.event.pull_request.number }}"

				        env:

									
										2

.github/workflows/pr-require-backport-label.yaml
									
										vendored
									
												View File
												
				@@ -13,6 +13,8 @@ jobs:

				      issues: write

				      pull-requests: write

				    steps:

				      - name: Wait for label to be added

				        run: sleep 1m

				      - uses: mheap/github-action-required-labels@v5

				        with:

				          mode: minimum

5

.gitmodules vendored

View File

@@ -1,6 +1,6 @@
 [submodule "seastar"]
 	path = seastar
 	url = ../seastar
 	url = ../scylla-seastar
 	ignore = dirty
 [submodule "swagger-ui"]
 	path = swagger-ui
@@ -9,9 +9,6 @@
 [submodule "abseil"]
 	path = abseil
 	url = ../abseil-cpp
 [submodule "scylla-tools"]
 	path = tools/java
 	url = ../scylla-tools-java
 [submodule "scylla-python3"]
 	path = tools/python3
 	url = ../scylla-python3

									
										15

CMakeLists.txt
									
												View File
												
				@@ -163,14 +163,6 @@ file(MAKE_DIRECTORY "${scylla_gen_build_dir}")

				include(add_version_library)

				generate_scylla_version()

				add_library(scylla-zstd STATIC

				    zstd.cc)

				target_link_libraries(scylla-zstd

				  PRIVATE

				    db

				    Seastar::seastar

				    zstd::libzstd)

				add_library(scylla-main STATIC)

				target_sources(scylla-main

				  PRIVATE

				@@ -182,7 +174,7 @@ target_sources(scylla-main

				    compress.cc

				    converting_mutation_partition_applier.cc

				    counters.cc

				    direct_failure_detector/failure_detector.cc

				    sstable_dict_autotrainer.cc

				    duration.cc

				    exceptions/exceptions.cc

				    frozen_schema.cc

				@@ -204,6 +196,7 @@ target_sources(scylla-main

				    reader_concurrency_semaphore_group.cc

				    schema_mutations.cc

				    serializer.cc

				    service/direct_failure_detector/failure_detector.cc

				    sstables_loader.cc

				    table_helper.cc

				    tasks/task_handler.cc

				@@ -214,7 +207,6 @@ target_sources(scylla-main

				    vint-serialization.cc)

				target_link_libraries(scylla-main

				  PRIVATE

				    "$<LINK_LIBRARY:WHOLE_ARCHIVE,scylla-zstd>"

				    db

				    absl::headers

				    absl::btree

				@@ -371,3 +363,6 @@ endif()

				if(Scylla_BUILD_INSTRUMENTED)

				  add_subdirectory(pgo)

				endif()

				add_executable(patchelf

				  tools/patchelf.cc)

									
										25

HACKING.md
									
												View File
												
				@@ -220,28 +220,9 @@ On a development machine, one might run Scylla as

				$ SCYLLA_HOME=$HOME/scylla build/release/scylla --overprovisioned --developer-mode=yes

				```

				To interact with scylla it is recommended to build our versions of

				cqlsh and nodetool. They are available at

				https://github.com/scylladb/scylla-tools-java and can be built with

				```bash

				$ sudo ./install-dependencies.sh

				$ ant jar

				```

				cqlsh should work out of the box, but nodetool depends on a running

				scylla-jmx (https://github.com/scylladb/scylla-jmx). It can be build

				with

				```bash

				$ mvn package

				```

				and must be started with

				```bash

				$ ./scripts/scylla-jmx

				```

				To interact with scylla it is recommended to build our version of

				cqlsh. It is available at

				https://github.com/scylladb/scylla-cqlsh and is available as a submodule.

				### Branches and tags

2

SCYLLA-VERSION-GEN

View File

@@ -78,7 +78,7 @@ fi
 # Default scylla product/version tags
 PRODUCT=scylla
 VERSION=2025.2.0-dev
 VERSION=2025.3.9
 if test -f version
 then

									
										11

alternator/consumed_capacity.cc
									
												View File
												
				@@ -24,7 +24,7 @@ static constexpr uint64_t KB = 1024ULL;

				static constexpr uint64_t RCU_BLOCK_SIZE_LENGTH = 4*KB;

				static constexpr uint64_t WCU_BLOCK_SIZE_LENGTH = 1*KB;

				static bool should_add_capacity(const rjson::value& request) {

				bool consumed_capacity_counter::should_add_capacity(const rjson::value& request) {

				    const rjson::value* return_consumed = rjson::find(request, "ReturnConsumedCapacity");

				    if (!return_consumed) {

				        return false;

				@@ -62,15 +62,22 @@ static uint64_t calculate_half_units(uint64_t unit_block_size, uint64_t total_by

				rcu_consumed_capacity_counter::rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum) :

				        consumed_capacity_counter(should_add_capacity(request)),_is_quorum(is_quorum) {

				}

				uint64_t rcu_consumed_capacity_counter::get_half_units(uint64_t total_bytes, bool is_quorum) noexcept {

				    return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, total_bytes, is_quorum);

				}

				uint64_t rcu_consumed_capacity_counter::get_half_units() const noexcept {

				    return calculate_half_units(RCU_BLOCK_SIZE_LENGTH, _total_bytes, _is_quorum);

				    return get_half_units(_total_bytes, _is_quorum);

				}

				uint64_t wcu_consumed_capacity_counter::get_half_units() const noexcept {

				    return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, _total_bytes, true);

				}

				uint64_t wcu_consumed_capacity_counter::get_units(uint64_t total_bytes) noexcept {

				    return calculate_half_units(WCU_BLOCK_SIZE_LENGTH, total_bytes, true) * HALF_UNIT_MULTIPLIER;

				}

				wcu_consumed_capacity_counter::wcu_consumed_capacity_counter(const rjson::value& request) :

				        consumed_capacity_counter(should_add_capacity(request)) {

				}

									
										6

alternator/consumed_capacity.hh
									
												View File
												
				@@ -42,21 +42,25 @@ public:

				     */

				    virtual uint64_t get_half_units() const noexcept = 0;

				    uint64_t _total_bytes = 0;

				    static bool should_add_capacity(const rjson::value& request);

				protected:

				    bool _should_add_to_reponse = false;

				};

				class rcu_consumed_capacity_counter : public consumed_capacity_counter {

				    virtual uint64_t get_half_units() const noexcept;

				    bool _is_quorum = false;

				public:

				    rcu_consumed_capacity_counter(const rjson::value& request, bool is_quorum);

				    rcu_consumed_capacity_counter(): consumed_capacity_counter(false), _is_quorum(false){}

				    virtual uint64_t get_half_units() const noexcept;

				    static uint64_t get_half_units(uint64_t total_bytes, bool is_quorum) noexcept;

				};

				class wcu_consumed_capacity_counter : public consumed_capacity_counter {

				    virtual uint64_t get_half_units() const noexcept;

				public:

				    wcu_consumed_capacity_counter(const rjson::value& request);

				    static uint64_t get_units(uint64_t total_bytes) noexcept;

				};

				}

1124

alternator/executor.cc

View File

File diff suppressed because it is too large Load Diff

									
										57

alternator/executor.hh
									
												View File
												
				@@ -10,8 +10,8 @@

				#include <seastar/core/future.hh>

				#include "seastarx.hh"

				#include <seastar/json/json_elements.hh>

				#include <seastar/core/sharded.hh>

				#include <seastar/util/noncopyable_function.hh>

				#include "service/migration_manager.hh"

				#include "service/client_state.hh"

				@@ -58,29 +58,6 @@ namespace alternator {

				class rmw_operation;

				struct make_jsonable : public json::jsonable {

				    rjson::value _value;

				public:

				    explicit make_jsonable(rjson::value&& value);

				    std::string to_json() const override;

				};

				/**

				 * Make return type for serializing the object "streamed",

				 * i.e. direct to HTTP output stream. Note: only useful for

				 * (very) large objects as there are overhead issues with this

				 * as well, but for massive lists of return objects this can

				 * help avoid large allocations/many re-allocs

				 */

				json::json_return_type make_streamed(rjson::value&&);

				struct json_string : public json::jsonable {

				    std::string _value;

				public:

				    explicit json_string(std::string&& value);

				    std::string to_json() const override;

				};

				namespace parsed {

				class path;

				};

				@@ -169,8 +146,23 @@ class executor : public peering_sharded_service<executor> {

				public:

				    using client_state = service::client_state;

				    using request_return_type = std::variant<json::json_return_type, api_error>;

				    // request_return_type is the return type of the executor methods, which

				    // can be one of:

				    // 1. A string, which is the response body for the request.

				    // 2. A body_writer, an asynchronous function (returning future<>) that

				    //    takes an output_stream and writes the response body into it.

				    // 3. An api_error, which is an error response that should be returned to

				    //    the client.

				    // The body_writer is used for streaming responses, where the response body

				    // is written in chunks to the output_stream. This allows for efficient

				    // handling of large responses without needing to allocate a large buffer

				    // in memory.

				    using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;

				    using request_return_type = std::variant<std::string, body_writer, api_error>;

				    stats _stats;

				    // The metric_groups object holds this stat object's metrics registered

				    // as long as the stats object is alive.

				    seastar::metrics::metric_groups _metrics;

				    static constexpr auto ATTRS_COLUMN_NAME = ":attrs";

				    static constexpr auto KEYSPACE_NAME_PREFIX = "alternator_";

				    static constexpr std::string_view INTERNAL_TABLE_PREFIX = ".scylla.alternator.";

				@@ -237,11 +229,15 @@ public:

				        const std::optional<attrs_to_get>&,

				        uint64_t* = nullptr);

				    // Converts a multi-row selection result to JSON compatible with DynamoDB.

				    // For each row, this method calls item_callback, which takes the size of

				    // the item as the parameter.

				    static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,

				        const query::partition_slice&& slice,

				        shared_ptr<cql3::selection::selection> selection,

				        foreign_ptr<lw_shared_ptr<query::result>> query_result,

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get);

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,

				        noncopyable_function<void(uint64_t)> item_callback = {});

				    static void describe_single_item(const cql3::selection::selection&,

				        const std::vector<managed_bytes_opt>&,

				@@ -271,4 +267,13 @@ bool is_big(const rjson::value& val, int big_size = 100'000);

				// appropriate user-readable api_error::access_denied is thrown.

				future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);

				/**

				 * Make return type for serializing the object "streamed",

				 * i.e. direct to HTTP output stream. Note: only useful for

				 * (very) large objects as there are overhead issues with this

				 * as well, but for massive lists of return objects this can

				 * help avoid large allocations/many re-allocs

				 */

				executor::body_writer make_streamed(rjson::value&&);

				}

									
										24

alternator/expressions.cc
									
												View File
												
				@@ -165,7 +165,9 @@ static std::optional<std::string> resolve_path_component(const std::string& colu

				                    fmt::format("ExpressionAttributeNames missing entry '{}' required by expression", column_name));

				        }

				        used_attribute_names.emplace(column_name);

				        return std::string(rjson::to_string_view(*value));

				        auto result = std::string(rjson::to_string_view(*value));

				        validate_attr_name_length("", result.size(), false, "ExpressionAttributeNames contains invalid value: ");

				        return result;

				    }

				    return std::nullopt;

				}

				@@ -737,6 +739,26 @@ rjson::value calculate_value(const parsed::set_rhs& rhs,

				    return rjson::null_value();

				}

				void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix) {

				    constexpr const size_t DYNAMODB_KEY_ATTR_NAME_SIZE_MAX = 255;

				    constexpr const size_t DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX = 65535;

				    const size_t max_length = is_key ? DYNAMODB_KEY_ATTR_NAME_SIZE_MAX : DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX;

				    if (attr_name_length > max_length) {

				        std::string error_msg;

				        if (!error_msg_prefix.empty()) {

				            error_msg += error_msg_prefix;

				        }

				        if (!supplementary_context.empty()) {

				            error_msg += "in ";

				            error_msg += supplementary_context;

				            error_msg += " - ";

				        }

				        error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));

				        throw api_error::validation(error_msg);

				    }

				}

				} // namespace alternator

				auto fmt::formatter<alternator::parsed::path>::format(const alternator::parsed::path& p, fmt::format_context& ctx) const

12

alternator/expressions.g

View File

@@ -91,6 +91,18 @@ options {
         throw expressions_syntax_error(format("{} at char {}", err,
             ex->get_charPositionInLine()));
     }
     // ANTLR3 tries to recover missing tokens - it tries to finish parsing
     // and create valid objects, as if the missing token was there.
     // But it has a bug and leaks these tokens.
     // We override offending method and handle abandoned pointers.
     std::vector<std::unique_ptr<TokenType>> _missing_tokens;
     TokenType* getMissingSymbol(IntStreamType* istream, ExceptionBaseType* e,
                                 ANTLR_UINT32 expectedTokenType, BitsetListType* follow) {
         auto token = BaseType::getMissingSymbol(istream, e, expectedTokenType, follow);
         _missing_tokens.emplace_back(token);
         return token;
     }
 }
 @lexer::context {
     void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {

									
										2

alternator/expressions.hh
									
												View File
												
				@@ -91,5 +91,7 @@ rjson::value calculate_value(const parsed::value& v,

				rjson::value calculate_value(const parsed::set_rhs& rhs,

				        const rjson::value* previous_item);

				void validate_attr_name_length(std::string_view supplementary_context, size_t attr_name_length, bool is_key, std::string_view error_msg_prefix = {});

				} /* namespace alternator */

									
										3

alternator/rmw_operation.hh
									
												View File
												
				@@ -118,7 +118,8 @@ public:

				            tracing::trace_state_ptr trace_state,

				            service_permit permit,

				            bool needs_read_before_write,

				            stats& stats,

				            stats& global_stats,

				            stats& per_table_stats,

				            uint64_t& wcu_total);

				    std::optional<shard_id> shard_for_execute(bool needs_read_before_write);

				};

									
										45

alternator/server.cc
									
												View File
												
				@@ -13,7 +13,6 @@

				#include <seastar/http/function_handlers.hh>

				#include <seastar/http/short_streams.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/json/json_elements.hh>

				#include <seastar/util/defer.hh>

				#include <seastar/util/short_streams.hh>

				#include "seastarx.hh"

				@@ -124,22 +123,22 @@ public:

				             }

				             auto res = resf.get();

				             std::visit(overloaded_functor {

				                 [&] (const json::json_return_type& json_return_value) {

				                     slogger.trace("api_handler success case");

				                     if (json_return_value._body_writer) {

				                         // Unfortunately, write_body() forces us to choose

				                         // from a fixed and irrelevant list of "mime-types"

				                         // at this point. But we'll override it with the

				                         // one (application/x-amz-json-1.0) below.

				                         rep->write_body("json", std::move(json_return_value._body_writer));

				                     } else {

				                         rep->_content += json_return_value._res;

				                     }

				                 },

				                 [&] (const api_error& err) {

				                     generate_error_reply(*rep, err);

				                 }

				             }, res);

				                [&] (std::string&& str) {

				                    // Note that despite the move, there is a copy here -

				                    // as str is std::string and rep->_content is sstring.

				                    rep->_content = std::move(str);

				                },

				                [&] (executor::body_writer&& body_writer) {

				                    // Unfortunately, write_body() forces us to choose

				                    // from a fixed and irrelevant list of "mime-types"

				                    // at this point. But we'll override it with the

				                    // correct one (application/x-amz-json-1.0) below.

				                    rep->write_body("json", std::move(body_writer));

				                },

				                [&] (const api_error& err) {

				                    generate_error_reply(*rep, err);

				                }

				             }, std::move(res));

				             return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				         });

				@@ -228,9 +227,8 @@ protected:

				        // If the rack does not exist, we return an empty list - not an error.

				        sstring query_rack = req->get_query_param("rack");

				        for (auto& id : local_dc_nodes) {

				            auto ip = _gossiper.get_address_map().get(id);

				            if (!query_rack.empty()) {

				                auto rack = _gossiper.get_application_state_value(ip, gms::application_state::RACK);

				                auto rack = _gossiper.get_application_state_value(id, gms::application_state::RACK);

				                if (rack != query_rack) {

				                    continue;

				                }

				@@ -238,10 +236,10 @@ protected:

				            // Note that it's not enough for the node to be is_alive() - a

				            // node joining the cluster is also "alive" but not responsive to

				            // requests. We alive *and* normal. See #19694, #21538.

				            if (_gossiper.is_alive(ip) && _gossiper.is_normal(ip)) {

				            if (_gossiper.is_alive(id) && _gossiper.is_normal(id)) {

				                // Use the gossiped broadcast_rpc_address if available instead

				                // of the internal IP address "ip". See discussion in #18711.

				                rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(ip)));

				                rjson::push_back(results, rjson::from_string(_gossiper.get_rpc_address(id)));

				            }

				        }

				        rep->set_status(reply::status_type::ok);

				@@ -463,6 +461,9 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				            client_state = std::move(client_state), trace_state = std::move(trace_state),

				            units = std::move(units), req = std::move(req)] () mutable -> future<executor::request_return_type> {

				                rjson::value json_request = co_await _json_parser.parse(std::move(content));

				                if (!json_request.IsObject()) {

				                    co_return api_error::validation("Request content must be an object");

				                }

				                co_return co_await callback(_executor, client_state, trace_state,

				                    make_service_permit(std::move(units)), std::move(json_request), std::move(req));

				    };

				@@ -505,7 +506,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos

				        , _key_cache(1024, 1min, slogger)

				        , _enforce_authorization(false)

				        , _enabled_servers{}

				        , _pending_requests{}

				        , _pending_requests("alternator::server::pending_requests")

				        , _timeout_config(_proxy.data_dictionary().get_config())

				      , _callbacks{

				        {"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {

									
										2

alternator/server.hh
									
												View File
												
				@@ -41,7 +41,7 @@ class server : public peering_sharded_service<server> {

				    key_cache _key_cache;

				    utils::updateable_value<bool> _enforce_authorization;

				    utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;

				    gate _pending_requests;

				    named_gate _pending_requests;

				    // In some places we will need a CQL updateable_timeout_config object even

				    // though it isn't really relevant for Alternator which defines its own

				    // timeouts separately. We can create this object only once.

									
										153

alternator/stats.cc
									
												View File
												
				@@ -14,28 +14,58 @@

				namespace alternator {

				const char* ALTERNATOR_METRICS = "alternator";

				static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {

				    seastar::metrics::histogram res;

				    res.buckets.resize(histogram.bucket_offsets.size());

				    uint64_t cumulative_count = 0;

				    res.sample_count = histogram._count;

				    res.sample_sum = histogram._sample_sum;

				    for (size_t i = 0; i < res.buckets.size(); i++) {

				        auto& v = res.buckets[i];

				        v.upper_bound = histogram.bucket_offsets[i];

				        cumulative_count += histogram.buckets[i];

				        v.count = cumulative_count;

				    }

				    return res;

				}

				static seastar::metrics::label column_family_label("cf");

				static seastar::metrics::label keyspace_label("ks");

				static void register_metrics_with_optional_table(seastar::metrics::metric_groups& metrics, const stats& stats, const sstring& ks, const sstring& table) {

				stats::stats() : api_operations{} {

				    // Register the

				    seastar::metrics::label op("op");

				    _metrics.add_group("alternator", {

				    bool has_table = table.length();

				    std::vector<seastar::metrics::label> aggregate_labels;

				    std::vector<seastar::metrics::label_instance> labels = {alternator_label};

				    sstring group_name = (has_table)? "alternator_table" : "alternator";

				    if (has_table) {

				        labels.push_back(column_family_label(table));

				        labels.push_back(keyspace_label(ks));

				        aggregate_labels.push_back(seastar::metrics::shard_label);

				    }

				    metrics.add_group(group_name, {

				#define OPERATION(name, CamelCaseName) \

				                seastar::metrics::make_total_operations("operation", api_operations.name, \

				                        seastar::metrics::description("number of operations via Alternator API"), {op(CamelCaseName), alternator_label, basic_level}).set_skip_when_empty(),

				                seastar::metrics::make_total_operations("operation", stats.api_operations.name, \

				                        seastar::metrics::description("number of operations via Alternator API"), labels)(basic_level)(op(CamelCaseName)).aggregate(aggregate_labels).set_skip_when_empty(),

				#define OPERATION_LATENCY(name, CamelCaseName) \

						metrics.add_group(group_name, { \

				                seastar::metrics::make_histogram("op_latency", \

				                        seastar::metrics::description("Latency histogram of an operation via Alternator API"), {op(CamelCaseName), alternator_label, basic_level}, [this]{return to_metrics_histogram(api_operations.name.histogram());}).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(), \

				                        seastar::metrics::description("Latency histogram of an operation via Alternator API"), labels, [&stats]{return to_metrics_histogram(stats.api_operations.name.histogram());})(op(CamelCaseName))(basic_level).aggregate({seastar::metrics::shard_label}).set_skip_when_empty()}); \

				            if (!has_table) {\

				            	metrics.add_group("alternator", { \

								seastar::metrics::make_summary("op_latency_summary", \

										                        seastar::metrics::description("Latency summary of an operation via Alternator API"), [this]{return to_metrics_summary(api_operations.name.summary());})(op(CamelCaseName))(basic_level)(alternator_label).set_skip_when_empty(),

										                        seastar::metrics::description("Latency summary of an operation via Alternator API"), [&stats]{return to_metrics_summary(stats.api_operations.name.summary());})(op(CamelCaseName))(basic_level)(alternator_label).set_skip_when_empty()}); \

				            }

				            OPERATION(batch_get_item, "BatchGetItem")

				            OPERATION(batch_write_item, "BatchWriteItem")

				            OPERATION(create_backup, "CreateBackup")

				            OPERATION(create_global_table, "CreateGlobalTable")

				            OPERATION(create_table, "CreateTable")

				            OPERATION(delete_backup, "DeleteBackup")

				            OPERATION(delete_item, "DeleteItem")

				            OPERATION(delete_table, "DeleteTable")

				            OPERATION(describe_backup, "DescribeBackup")

				            OPERATION(describe_continuous_backups, "DescribeContinuousBackups")

				            OPERATION(describe_endpoints, "DescribeEndpoints")

				@@ -64,55 +94,74 @@ stats::stats() : api_operations{} {

				            OPERATION(update_item, "UpdateItem")

				            OPERATION(update_table, "UpdateTable")

				            OPERATION(update_time_to_live, "UpdateTimeToLive")

				            OPERATION_LATENCY(put_item_latency, "PutItem")

				            OPERATION_LATENCY(get_item_latency, "GetItem")

				            OPERATION_LATENCY(delete_item_latency, "DeleteItem")

				            OPERATION_LATENCY(update_item_latency, "UpdateItem")

				            OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")

				            OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")

				            OPERATION(list_streams, "ListStreams")

				            OPERATION(describe_stream, "DescribeStream")

				            OPERATION(get_shard_iterator, "GetShardIterator")

				            OPERATION(get_records, "GetRecords")

				            OPERATION_LATENCY(get_records_latency, "GetRecords")

				    });

				    _metrics.add_group("alternator", {

				            seastar::metrics::make_total_operations("unsupported_operations", unsupported_operations,

				                    seastar::metrics::description("number of unsupported operations via Alternator API"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("total_operations", total_operations,

				                    seastar::metrics::description("number of total operations via Alternator API"))(basic_level)(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("reads_before_write", reads_before_write,

				                    seastar::metrics::description("number of performed read-before-write operations"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("write_using_lwt", write_using_lwt,

				                    seastar::metrics::description("number of writes that used LWT"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("shard_bounce_for_lwt", shard_bounce_for_lwt,

				                    seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("requests_blocked_memory", requests_blocked_memory,

				                    seastar::metrics::description("Counts a number of requests blocked due to memory pressure."))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("requests_shed", requests_shed,

				                    seastar::metrics::description("Counts a number of requests shed due to overload."))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_read_total", cql_stats.filtered_rows_read_total,

				                    seastar::metrics::description("number of rows read during filtering operations"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_matched_total", cql_stats.filtered_rows_matched_total,

				                    seastar::metrics::description("number of rows read and matched during filtering operations")),

				            seastar::metrics::make_counter("rcu_total", rcu_total,

				                    seastar::metrics::description("total number of consumed read units, counted as half units"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::PUT_ITEM],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("PutItem")})(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::DELETE_ITEM],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("DeleteItem")})(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::UPDATE_ITEM],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("UpdateItem")})(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", wcu_total[wcu_types::INDEX],

				                    seastar::metrics::description("total number of consumed write units, counted as half units"),{op("Index")})(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_dropped_total", [this] { return cql_stats.filtered_rows_read_total - cql_stats.filtered_rows_matched_total; },

				                    seastar::metrics::description("number of rows read and dropped during filtering operations"))(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchWriteItem")},

				                    api_operations.batch_write_item_batch_total)(alternator_label).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"),{op("BatchGetItem")},

				                    api_operations.batch_get_item_batch_total)(alternator_label).set_skip_when_empty(),

				    OPERATION_LATENCY(put_item_latency, "PutItem")

				    OPERATION_LATENCY(get_item_latency, "GetItem")

				    OPERATION_LATENCY(delete_item_latency, "DeleteItem")

				    OPERATION_LATENCY(update_item_latency, "UpdateItem")

				    OPERATION_LATENCY(batch_write_item_latency, "BatchWriteItem")

				    OPERATION_LATENCY(batch_get_item_latency, "BatchGetItem")

				    OPERATION_LATENCY(get_records_latency, "GetRecords")

				    if (!has_table) {

				        // Create and delete operations are not applicable to a per-table metrics

				        // only register it for the global metrics

				        metrics.add_group("alternator", {

				            OPERATION(create_table, "CreateTable")

				            OPERATION(delete_table, "DeleteTable")

				        });

				    }

				    metrics.add_group(group_name, {

				            seastar::metrics::make_total_operations("unsupported_operations", stats.unsupported_operations,

				                    seastar::metrics::description("number of unsupported operations via Alternator API"), labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("total_operations", stats.total_operations,

				                    seastar::metrics::description("number of total operations via Alternator API"), labels)(basic_level).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("reads_before_write", stats.reads_before_write,

				                    seastar::metrics::description("number of performed read-before-write operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("write_using_lwt", stats.write_using_lwt,

				                    seastar::metrics::description("number of writes that used LWT"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("shard_bounce_for_lwt", stats.shard_bounce_for_lwt,

				                    seastar::metrics::description("number writes that had to be bounced from this shard because of LWT requirements"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("requests_blocked_memory", stats.requests_blocked_memory,

				                    seastar::metrics::description("Counts a number of requests blocked due to memory pressure."), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("requests_shed", stats.requests_shed,

				                    seastar::metrics::description("Counts a number of requests shed due to overload."), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_read_total", stats.cql_stats.filtered_rows_read_total,

				                    seastar::metrics::description("number of rows read during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_matched_total", stats.cql_stats.filtered_rows_matched_total,

				                    seastar::metrics::description("number of rows read and matched during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("rcu_total", [&stats]{return 0.5 * stats.rcu_half_units_total;},

				                    seastar::metrics::description("total number of consumed read units"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::PUT_ITEM],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("PutItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::DELETE_ITEM],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("DeleteItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::UPDATE_ITEM],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("UpdateItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("wcu_total", stats.wcu_total[stats::wcu_types::INDEX],

				                    seastar::metrics::description("total number of consumed write units"), labels)(op("Index")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("filtered_rows_dropped_total", [&stats] { return stats.cql_stats.filtered_rows_read_total - stats.cql_stats.filtered_rows_matched_total; },

				                    seastar::metrics::description("number of rows read and dropped during filtering operations"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,

				                    stats.api_operations.batch_write_item_batch_total)(op("BatchWriteItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,

				                    stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				    });

				}

				void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {

				    register_metrics_with_optional_table(metrics, stats, "", "");

				}

				table_stats::table_stats(const sstring& ks, const sstring& table) {

				    _stats = make_lw_shared<stats>();

				    register_metrics_with_optional_table(_metrics, *_stats, ks, table);

				}

				}

									
										18

alternator/stats.hh
									
												View File
												
				@@ -12,6 +12,7 @@

				#include <seastar/core/metrics_registration.hh>

				#include "utils/histogram.hh"

				#include "utils/estimated_histogram.hh"

				#include "cql3/stats.hh"

				namespace alternator {

				@@ -21,7 +22,6 @@ namespace alternator {

				// visible by the metrics REST API, with the "alternator" prefix.

				class stats {

				public:

				    stats();

				    // Count of DynamoDB API operations by types

				    struct {

				        uint64_t batch_get_item = 0;

				@@ -75,6 +75,9 @@ public:

				        utils::timed_rate_moving_average_summary_and_histogram batch_write_item_latency;

				        utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;

				        utils::timed_rate_moving_average_summary_and_histogram get_records_latency;

				        utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100

				        utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100

				    } api_operations;

				    // Miscellaneous event counters

				    uint64_t total_operations = 0;

				@@ -84,7 +87,7 @@ public:

				    uint64_t shard_bounce_for_lwt = 0;

				    uint64_t requests_blocked_memory = 0;

				    uint64_t requests_shed = 0;

				    uint64_t rcu_total = 0;

				    uint64_t rcu_half_units_total = 0;

				    // wcu can results from put, update, delete and index

				    // Index related will be done on top of the operation it comes with

				    enum wcu_types {

				@@ -98,10 +101,13 @@ public:

				    uint64_t wcu_total[NUM_TYPES] = {0};

				    // CQL-derived stats

				    cql3::cql_stats cql_stats;

				private:

				    // The metric_groups object holds this stat object's metrics registered

				    // as long as the stats object is alive.

				    seastar::metrics::metric_groups _metrics;

				};

				struct table_stats {

				    table_stats(const sstring& ks, const sstring& table);

				    seastar::metrics::metric_groups _metrics;

				    lw_shared_ptr<stats> _stats;

				};

				void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);

				}

									
										15

alternator/streams.cc
									
												View File
												
				@@ -217,7 +217,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str

				        rjson::add(ret, "LastEvaluatedStreamArn", *last);

				    }

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				    return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				}

				struct shard_id {

				@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    if (!opts.enabled()) {

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				        return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				    }

				    // TODO: label

				@@ -617,7 +617,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        rjson::add(stream_desc, "Shards", std::move(shards));

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				        return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				    });

				}

				@@ -770,7 +770,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    auto ret = rjson::empty_object();

				    rjson::add(ret, "ShardIterator", iter);

				    return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				    return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				}

				struct event_id {

				@@ -808,6 +808,9 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    if (limit < 1) {

				        throw api_error::validation("Limit must be 1 or more");

				    }

				    if (limit > 1000) {

				        throw api_error::validation("Limit must be less than or equal to 1000");

				    }

				    auto db = _proxy.data_dictionary();

				    schema_ptr schema, base;

				@@ -1018,7 +1021,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            // will notice end end of shard and not return NextShardIterator.

				            rjson::add(ret, "NextShardIterator", next_iter);

				            _stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				            return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        }

				        // ugh. figure out if we are and end-of-shard

				@@ -1044,7 +1047,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				            if (is_big(ret)) {

				                return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));

				            }

				            return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));

				            return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        });

				    });

				}

									
										111

alternator/ttl.cc
									
												View File
												
				@@ -81,11 +81,6 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				        co_return api_error::validation("UpdateTimeToLive requires boolean Enabled");

				    }

				    bool enabled = v->GetBool();

				    // Alternator TTL doesn't yet work when the table uses tablets (#16567)

				    if (enabled && _proxy.local_db().find_keyspace(schema->ks_name()).get_replication_strategy().uses_tablets()) {

				        co_return api_error::validation("TTL not yet supported on a table using tablets (issue #16567). "

				            "Create a table with the tag 'experimental:initial_tablets' set to 'none' to use vnodes.");

				    }

				    v = rjson::find(*spec, "AttributeName");

				    if (!v || !v->IsString()) {

				        co_return api_error::validation("UpdateTimeToLive requires string AttributeName");

				@@ -123,7 +118,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				    // basically identical to the request's

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TimeToLiveSpecification", std::move(*spec));

				    co_return make_jsonable(std::move(response));

				    co_return rjson::print(std::move(response));

				}

				future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				@@ -140,7 +135,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta

				    }

				    rjson::value response = rjson::empty_object();

				    rjson::add(response, "TimeToLiveDescription", std::move(desc));

				    co_return make_jsonable(std::move(response));

				    co_return rjson::print(std::move(response));

				}

				// expiration_service is a sharded service responsible for cleaning up expired

				@@ -315,6 +310,8 @@ static size_t random_offset(size_t min, size_t max) {

				// this range's primary node is down. For this we need to return not just

				// a list of this node's secondary ranges - but also the primary owner of

				// each of those ranges.

				//

				// The function is to be used with vnodes only

				static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_secondary_ranges(

				        const locator::effective_replication_map_ptr& erm,

				        locator::host_id ep) {

				@@ -429,6 +426,8 @@ public:

				    }

				};

				// The token_ranges_owned_by_this_shard class is only used for vnodes, where the vnodes give a partition range for the entire node

				// and such range still needs to be divided between the shards.

				template<class primary_or_secondary_t>

				class token_ranges_owned_by_this_shard {

				    schema_ptr _s;

				@@ -655,6 +654,17 @@ static future<> scan_table_ranges(

				    }

				}

				static future<> scan_tablet(locator::tablet_id tablet, service::storage_proxy& proxy, abort_source& abort_source, named_semaphore& page_sem,

				            expiration_service::stats& expiration_stats, const scan_ranges_context& scan_ctx, const locator::tablet_map& tablet_map) {

				    auto tablet_token_range = tablet_map.get_token_range(tablet);

				    dht::ring_position tablet_start(tablet_token_range.start()->value(), dht::ring_position::token_bound::start),

				                       tablet_end(tablet_token_range.end()->value(), dht::ring_position::token_bound::end);

				    auto partition_range = dht::partition_range::make(std::move(tablet_start), std::move(tablet_end));

				    // Note that because of issue #9167 we need to run a separate query on each partition range, and can't pass

				    // several of them into one partition_range_vector that is passed to scan_table_ranges().

				    return scan_table_ranges(proxy, scan_ctx, {partition_range}, abort_source, page_sem, expiration_stats);

				}

				// scan_table() scans, in one table, data "owned" by this shard, looking for

				// expired items and deleting them.

				// We consider each node to "own" its primary token ranges, i.e., the tokens

				@@ -730,34 +740,65 @@ static future<bool> scan_table(

				    expiration_stats.scan_table++;

				    // FIXME: need to pace the scan, not do it all at once.

				    scan_ranges_context scan_ctx{s, proxy, std::move(column_name), std::move(member)};

				    auto erm = db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();

				    auto my_host_id = erm->get_topology().my_host_id();

				    token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));

				    while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {

				        // Note that because of issue #9167 we need to run a separate

				        // query on each partition range, and can't pass several of

				        // them into one partition_range_vector.

				        dht::partition_range_vector partition_ranges;

				        partition_ranges.push_back(std::move(*range));

				        // FIXME: if scanning a single range fails, including network errors,

				        // we fail the entire scan (and rescan from the beginning). Need to

				        // reconsider this. Saving the scan position might be a good enough

				        // solution for this problem.

				        co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				    }

				    // If each node only scans its own primary ranges, then when any node is

				    // down part of the token range will not get scanned. This can be viewed

				    // as acceptable (when the comes back online, it will resume its scan),

				    // but as noted in issue #9787, we can allow more prompt expiration

				    // by tasking another node to take over scanning of the dead node's primary

				    // ranges. What we do here is that this node will also check expiration

				    // on its *secondary* ranges - but only those whose primary owner is down.

				    token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));

				    while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {

				        expiration_stats.secondary_ranges_scanned++;

				        dht::partition_range_vector partition_ranges;

				        partition_ranges.push_back(std::move(*range));

				        co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				    if (s->table().uses_tablets()) {

				        locator::effective_replication_map_ptr erm = s->table().get_effective_replication_map();

				        auto my_host_id = erm->get_topology().my_host_id();

				        const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());

				        for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {

				            auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet);

				            // check if this is the primary replica for the current tablet

				            if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {

				                co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

				            } else if(erm->get_replication_factor() > 1) {

				                // Check if this is the secondary replica for the current tablet

				                // and if the primary replica is down which means we will take over this work.

				                // If each node only scans its own primary ranges, then when any node is

				                // down part of the token range will not get scanned. This can be viewed

				                // as acceptable (when the comes back online, it will resume its scan),

				                // but as noted in issue #9787, we can allow more prompt expiration

				                // by tasking another node to take over scanning of the dead node's primary

				                // ranges. What we do here is that this node will also check expiration

				                // on its *secondary* ranges - but only those whose primary owner is down.

				                auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica

				                if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {

				                    if (!gossiper.is_alive(tablet_primary_replica.host)) {

				                        co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

				                    }

				                }

				            }

				        }

				    } else {  // VNodes

				        locator::vnode_effective_replication_map_ptr erm =

				                db.real_database().find_keyspace(s->ks_name()).get_vnode_effective_replication_map();

				        auto my_host_id = erm->get_topology().my_host_id();

				        token_ranges_owned_by_this_shard my_ranges(s, co_await ranges_holder_primary::make(erm, my_host_id));

				        while (std::optional<dht::partition_range> range = my_ranges.next_partition_range()) {

				            // Note that because of issue #9167 we need to run a separate

				            // query on each partition range, and can't pass several of

				            // them into one partition_range_vector.

				            dht::partition_range_vector partition_ranges;

				            partition_ranges.push_back(std::move(*range));

				            // FIXME: if scanning a single range fails, including network errors,

				            // we fail the entire scan (and rescan from the beginning). Need to

				            // reconsider this. Saving the scan position might be a good enough

				            // solution for this problem.

				            co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				        }

				        // If each node only scans its own primary ranges, then when any node is

				        // down part of the token range will not get scanned. This can be viewed

				        // as acceptable (when the comes back online, it will resume its scan),

				        // but as noted in issue #9787, we can allow more prompt expiration

				        // by tasking another node to take over scanning of the dead node's primary

				        // ranges. What we do here is that this node will also check expiration

				        // on its *secondary* ranges - but only those whose primary owner is down.

				        token_ranges_owned_by_this_shard my_secondary_ranges(s, co_await ranges_holder_secondary::make(erm, my_host_id, gossiper));

				        while (std::optional<dht::partition_range> range = my_secondary_ranges.next_partition_range()) {

				            expiration_stats.secondary_ranges_scanned++;

				            dht::partition_range_vector partition_ranges;

				            partition_ranges.push_back(std::move(*range));

				            co_await scan_table_ranges(proxy, scan_ctx, std::move(partition_ranges), abort_source, page_sem, expiration_stats);

				        }

				    }

				    co_return true;

				}

									
										58

api/api-doc/compaction_manager.json
									
												View File
												
				@@ -246,6 +246,24 @@

				            }

				         }

				      },

				      "sstableinfo":{

				         "id":"sstableinfo",

				         "description":"Compacted sstable information",

				         "properties":{

				            "generation":{

				               "type": "string",

				               "description":"Generation of the sstable"

				            },

				            "origin":{

				               "type":"string",

				               "description":"Origin of the sstable"

				            },

				            "size":{

				               "type":"long",

				               "description":"Size of the sstable"

				            }

				         }

				      },

				      "compaction_info" :{

				          "id": "compaction_info",

				          "description":"A key value mapping",

				@@ -327,6 +345,10 @@

				               "type":"string",

				               "description":"The UUID"

				            },

				            "shard_id":{

				               "type":"int",

				               "description":"The shard id the compaction was executed on"

				            },

				            "cf":{

				               "type":"string",

				               "description":"The column family name"

				@@ -335,9 +357,17 @@

				               "type":"string",

				               "description":"The keyspace name"

				            },

				            "compaction_type":{

				               "type":"string",

				               "description":"Type of compaction"

				            },

				            "started_at":{

				               "type":"long",

				               "description":"The time compaction started"

				            },

				            "compacted_at":{

				               "type":"long",

				               "description":"The time of compaction"

				               "description":"The time compaction completed"

				            },

				            "bytes_in":{

				               "type":"long",

				@@ -353,6 +383,32 @@

				                  "type":"row_merged"

				               },

				               "description":"The merged rows"

				            },

				            "sstables_in": {

				               "type":"array",

				               "items":{

				                  "type":"sstableinfo"

				               },

				               "description":"List of input sstables for compaction"

				            },

				            "sstables_out": {

				               "type":"array",

				               "items":{

				                  "type":"sstableinfo"

				               },

				               "description":"List of output sstables from compaction"

				            },

				            "total_tombstone_purge_attempt":{

				               "type":"long",

				               "description":"Total number of tombstone purge attempts"

				            },

				            "total_tombstone_purge_failure_due_to_overlapping_with_memtable":{

				               "type":"long",

				               "description":"Number of tombstone purge failures due to data overlapping with memtables"

				            },

				            "total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable":{

				               "type":"long",

				               "description":"Number of tombstone purge failures due to data overlapping with non-compacting sstables"

				            }

				        }

				      }

									
										8

api/api-doc/gossiper.json
									
												View File
												
				@@ -136,14 +136,6 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"path"

				                  },

				                  {

				                     "name":"unsafe",

				                     "description":"Set to True to perform an unsafe assassination",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

									
										164

api/api-doc/storage_service.json
									
												View File
												
				@@ -984,7 +984,7 @@

				         ]

				      },

				      {

				         "path":"/storage_service/cleanup_all",

				         "path":"/storage_service/cleanup_all/",

				         "operations":[

				            {

				               "method":"POST",

				@@ -994,6 +994,30 @@

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                    {

				                     "name":"global",

				                     "description":"true if cleanup of entire cluster is requested",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/mark_node_as_clean",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",

				               "type":"void",

				               "nickname":"reset_cleanup_needed",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[]

				            }

				         ]

				@@ -2144,6 +2168,31 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"skip_cleanup",

				                     "description":"Don't cleanup keys from loaded sstables. Invalid if load_and_stream is true",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"skip_reshape",

				                     "description":"Don't reshape the loaded sstables. Invalid if load_and_stream is true",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"scope",

				                     "description":"Defines the set of nodes to which mutations can be streamed",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query",

				                     "enum": ["all", "dc", "rack", "node"]

				                  }

				               ]

				            }

				@@ -3027,6 +3076,73 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/retrain_dict",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Retrain the SSTable compression dictionary for the target table.",

				               "type":"void",

				               "nickname":"retrain_dict",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"Name of the keyspace containing the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Name of the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/estimate_compression_ratios",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Compute an estimated compression ratio for SSTables of the given table, for various compression configurations.",

				               "type":"array",

				               "items":{

				                  "type":"compression_config_result"

				               },

				               "nickname":"estimate_compression_ratios",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"Name of the keyspace containing the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"cf",

				                     "description":"Name of the target table.",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/raft_topology/reload",

				         "operations":[

				@@ -3069,6 +3185,22 @@

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/raft_topology/cmd_rpc_status",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get information about currently running topology cmd rpc",

				               "type":"string",

				               "nickname":"raft_topology_get_cmd_status",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      }

				   ],

				   "models":{

				@@ -3205,11 +3337,11 @@

				         "properties":{

				            "start_token":{

				               "type":"string",

				               "description":"The range start token"

				               "description":"The range start token (exclusive)"

				            },

				            "end_token":{

				               "type":"string",

				               "description":"The range start token"

				               "description":"The range end token (inclusive)"

				            },

				            "endpoints":{

				               "type":"array",

				@@ -3328,6 +3460,32 @@

				                "type":"string"

				            }

				        }

				      },

				      "compression_config_result":{

				         "id":"compression_config_result",

				         "description":"Compression ratio estimation result for one config",

				         "properties":{

				            "level":{

				               "type":"long",

				               "description":"The used value of `compression_level`"

				            },

				            "chunk_length_in_kb":{

				               "type":"long",

				               "description":"The used value of `chunk_length_in_kb`"

				            },

				            "dict":{

				               "type":"string",

				               "description":"The used dictionary: `none`, `past` (== current), or `future`"

				            },

				            "sstable_compression":{

				               "type":"string",

				               "description":"The used compressor name (aka `sstable_compression`)"

				            },

				            "ratio":{

				               "type":"float",

				               "description":"The resulting compression ratio (estimated on a random sample of files)"

				            }

				         }

				      }

				   }

				}

									
										8

api/api-doc/tasks.json
									
												View File
												
				@@ -42,6 +42,14 @@

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"consider_only_existing_data",

				                     "description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

									
										27

api/api.cc
									
												View File
												
				@@ -391,32 +391,5 @@ future<> unset_server_raft(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_raft(ctx, r); });

				}

				void req_params::process(const request& req) {

				    // Process mandatory parameters

				    for (auto& [name, ent] : params) {

				        if (!ent.is_mandatory) {

				            continue;

				        }

				        try {

				            ent.value = req.get_path_param(name);

				        } catch (std::out_of_range&) {

				            throw httpd::bad_param_exception(fmt::format("Mandatory parameter '{}' was not provided", name));

				        }

				    }

				    // Process optional parameters

				    for (auto& [name, value] : req.query_parameters) {

				        try {

				            auto& ent = params.at(name);

				            if (ent.is_mandatory) {

				                throw httpd::bad_param_exception(fmt::format("Parameter '{}' is expected to be provided as part of the request url", name));

				            }

				            ent.value = value;

				        } catch (std::out_of_range&) {

				            throw httpd::bad_param_exception(fmt::format("Unsupported optional parameter '{}'", name));

				        }

				    }

				}

				}

									
										83

api/api.hh
									
												View File
												
				@@ -23,17 +23,6 @@

				namespace api {

				template<class T>

				std::vector<sstring> container_to_vec(const T& container) {

				    std::vector<sstring> res;

				    res.reserve(std::size(container));

				    for (const auto& i : container) {

				        res.push_back(fmt::to_string(i));

				    }

				    return res;

				}

				template<class T>

				std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {

				    std::vector<T> res;

				@@ -67,17 +56,6 @@ T map_sum(T&& dest, const S& src) {

				    return std::move(dest);

				}

				template <typename MAP>

				std::vector<sstring> map_keys(const MAP& map) {

				    std::vector<sstring> res;

				    res.reserve(std::size(map));

				    for (const auto& i : map) {

				        res.push_back(fmt::to_string(i.first));

				    }

				    return res;

				}

				/**

				 * General sstring splitting function

				 */

				@@ -252,67 +230,6 @@ public:

				    operator T() const { return value; }

				};

				using mandatory = bool_class<struct mandatory_tag>;

				class req_params {

				public:

				    struct def {

				        std::optional<sstring> value;

				        mandatory is_mandatory = mandatory::no;

				        def(std::optional<sstring> value_ = std::nullopt, mandatory is_mandatory_ = mandatory::no)

				            : value(std::move(value_))

				            , is_mandatory(is_mandatory_)

				        { }

				        def(mandatory is_mandatory_)

				            : is_mandatory(is_mandatory_)

				        { }

				    };

				private:

				    std::unordered_map<sstring, def> params;

				public:

				    req_params(std::initializer_list<std::pair<sstring, def>> l) {

				        for (const auto& [name, ent] : l) {

				            add(std::move(name), std::move(ent));

				        }

				    }

				    void add(sstring name, def ent) {

				        params.emplace(std::move(name), std::move(ent));

				    }

				    void process(const request& req);

				    const std::optional<sstring>& get(const char* name) const {

				        return params.at(name).value;

				    }

				    template <typename T = sstring>

				    const std::optional<T> get_as(const char* name) const {

				        return get(name);

				    }

				    template <typename T = sstring>

				    requires std::same_as<T, bool>

				    const std::optional<bool> get_as(const char* name) const {

				        auto value = get(name);

				        if (!value) {

				            return std::nullopt;

				        }

				        std::transform(value->begin(), value->end(), value->begin(), ::tolower);

				        if (value == "true" || value == "yes" || value == "1") {

				            return true;

				        }

				        if (value == "false" || value == "no" || value == "0") {

				            return false;

				        }

				        throw boost::bad_lexical_cast{};

				    }

				};

				httpd::utils_json::estimated_histogram time_to_json_histogram(const utils::time_estimated_histogram& val);

				}

									
										47

api/column_family.cc
									
												View File
												
				@@ -360,13 +360,7 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				        });

				    cf::get_column_family_name_keyspace.set(r, [&ctx] (const_req req){

				        std::vector<sstring> res;

				        const flat_hash_map<sstring, replica::keyspace>& keyspaces = ctx.db.local().get_keyspaces();

				        res.reserve(keyspaces.size());

				        for (const auto& i : keyspaces) {

				            res.push_back(i.first);

				        }

				        return res;

				        return ctx.db.local().get_all_keyspaces();

				    });

				    cf::get_memtable_columns_count.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				@@ -902,17 +896,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				    });

				    ss::enable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        auto [keyspace, tables] = parse_table_infos(ctx, *req);

				        apilog.info("enable_auto_compaction: keyspace={} tables={}", keyspace, tables);

				        return set_tables_autocompaction(ctx, std::move(tables), true);

				    });

				    ss::disable_auto_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        auto [keyspace, tables] = parse_table_infos(ctx, *req);

				        apilog.info("disable_auto_compaction: keyspace={} tables={}", keyspace, tables);

				        return set_tables_autocompaction(ctx, std::move(tables), false);

				    });

				@@ -936,25 +926,19 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				    });

				    ss::enable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        auto [keyspace, tables] = parse_table_infos(ctx, *req);

				        apilog.info("enable_tombstone_gc: keyspace={} tables={}", keyspace, tables);

				        return set_tables_tombstone_gc(ctx, std::move(tables), true);

				    });

				    ss::disable_tombstone_gc.set(r, [&ctx](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto tables = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        auto [keyspace, tables] = parse_table_infos(ctx, *req);

				        apilog.info("disable_tombstone_gc: keyspace={} tables={}", keyspace, tables);

				        return set_tables_tombstone_gc(ctx, std::move(tables), false);

				    });

				    cf::get_built_indexes.set(r, [&ctx, &sys_ks](std::unique_ptr<http::request> req) {

				        auto ks_cf = parse_fully_qualified_cf_name(req->get_path_param("name"));

				        auto&& ks = std::get<0>(ks_cf);

				        auto&& cf_name = std::get<1>(ks_cf);

				        auto [ks, cf_name] = parse_fully_qualified_cf_name(req->get_path_param("name"));

				        // Use of load_built_views() as filtering table should be in sync with

				        // built_indexes_virtual_reader filtering with BUILT_VIEWS table

				        return sys_ks.local().load_built_views().then([ks, cf_name, &ctx](const std::vector<db::system_keyspace::view_name>& vb) mutable {

				@@ -1054,13 +1038,13 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				        return ctx.db.map_reduce0([key, uuid] (replica::database& db) -> future<std::unordered_set<sstring>> {

				            auto sstables = co_await db.find_column_family(uuid).get_sstables_by_partition_key(key);

				            co_return sstables | std::views::transform([] (auto s) { return s->get_filename(); }) | std::ranges::to<std::unordered_set>();

				            co_return sstables | std::views::transform([] (auto s) -> sstring { return fmt::to_string(s->get_filename()); }) | std::ranges::to<std::unordered_set>();

				        }, std::unordered_set<sstring>(),

				        [](std::unordered_set<sstring> a, std::unordered_set<sstring>&& b) mutable {

				            a.merge(b);

				            return a;

				        }).then([](const std::unordered_set<sstring>& res) {

				            return make_ready_future<json::json_return_type>(container_to_vec(res));

				            return make_ready_future<json::json_return_type>(res | std::ranges::to<std::vector>());

				        });

				    });

				@@ -1082,19 +1066,12 @@ void set_column_family(http_context& ctx, routes& r, sharded<db::system_keyspace

				    });

				    cf::force_major_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto params = req_params({

				            std::pair("name", mandatory::yes),

				            std::pair("flush_memtables", mandatory::no),

				            std::pair("consider_only_existing_data", mandatory::no),

				            std::pair("split_output", mandatory::no),

				        });

				        params.process(*req);

				        if (params.get("split_output")) {

				        if (req->query_parameters.contains("split_output")) {

				            fail(unimplemented::cause::API);

				        }

				        auto [ks, cf] = parse_fully_qualified_cf_name(*params.get("name"));

				        auto flush = params.get_as<bool>("flush_memtables").value_or(true);

				        auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);

				        auto [ks, cf] = parse_fully_qualified_cf_name(req->get_path_param("name"));

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				        apilog.info("column_family/force_major_compaction: name={} flush={} consider_only_existing_data={}", req->get_path_param("name"), flush, consider_only_existing_data);

				        auto keyspace = validate_keyspace(ctx, ks);

									
										25

api/compaction_manager.cc
									
												View File
												
				@@ -14,6 +14,7 @@

				#include "api/api.hh"

				#include "api/api-doc/compaction_manager.json.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "db/compaction_history_entry.hh"

				#include "db/system_keyspace.hh"

				#include "column_family.hh"

				#include "unimplemented.hh"

				@@ -111,8 +112,7 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    });

				    cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto ks_name = validate_keyspace(ctx, req);

				        auto tables = parse_table_infos(ks_name, ctx, req->query_parameters, "tables");

				        auto [ks_name, tables] = parse_table_infos(ctx, *req, "tables");

				        auto type = req->get_query_param("type");

				        co_await ctx.db.invoke_on_all([&] (replica::database& db) {

				            auto& cm = db.get_compaction_manager();

				@@ -160,8 +160,11 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				                co_await cm.local().get_compaction_history([&s, &first](const db::compaction_history_entry& entry) mutable -> future<> {

				                        cm::history h;

				                        h.id = fmt::to_string(entry.id);

				                        h.shard_id = entry.shard_id;

				                        h.ks = std::move(entry.ks);

				                        h.cf = std::move(entry.cf);

				                        h.compaction_type = entry.compaction_type;

				                        h.started_at = entry.started_at;

				                        h.compacted_at = entry.compacted_at;

				                        h.bytes_in = entry.bytes_in;

				                        h.bytes_out =  entry.bytes_out;

				@@ -173,6 +176,24 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				                            e.value = it.second;

				                            h.rows_merged.push(std::move(e));

				                        }

				                        for (const auto& data : entry.sstables_in) {

				                            httpd::compaction_manager_json::sstableinfo sstable;

				                            sstable.generation = fmt::to_string(data.generation),

				                            sstable.origin = data.origin,

				                            sstable.size = data.size,

				                            h.sstables_in.push(std::move(sstable));

				                        }

				                        for (const auto& data : entry.sstables_out) {

				                            httpd::compaction_manager_json::sstableinfo sstable;

				                            sstable.generation = fmt::to_string(data.generation),

				                            sstable.origin = data.origin,

				                            sstable.size = data.size,

				                            h.sstables_out.push(std::move(sstable));

				                        }

				                        h.total_tombstone_purge_attempt = entry.total_tombstone_purge_attempt;

				                        h.total_tombstone_purge_failure_due_to_overlapping_with_memtable = entry.total_tombstone_purge_failure_due_to_overlapping_with_memtable;

				                        h.total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable = entry.total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable;

				                        if (!first) {

				                            co_await s.write(", ");

				                        }

									
										2

api/config.cc
									
												View File
												
				@@ -187,7 +187,7 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx

				    });

				    ss::get_all_data_file_locations.set(r, [&cfg](const_req req) {

				        return container_to_vec(cfg.data_file_directories());

				        return cfg.data_file_directories();

				    });

				    ss::get_saved_caches_location.set(r, [&cfg](const_req req) {

									
										24

api/failure_detector.cc
									
												View File
												
				@@ -22,10 +22,10 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            std::vector<fd::endpoint_state> res;

				            res.reserve(g.num_endpoints());

				            g.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& eps) {

				            g.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {

				                fd::endpoint_state val;

				                val.addrs = fmt::to_string(addr);

				                val.is_alive = g.is_alive(addr);

				                val.addrs = fmt::to_string(eps.get_ip());

				                val.is_alive = g.is_alive(eps.get_host_id());

				                val.generation = eps.get_heart_beat_state().get_generation().value();

				                val.version = eps.get_heart_beat_state().get_heart_beat_version().value();

				                val.update_time = eps.get_update_timestamp().time_since_epoch().count();

				@@ -40,7 +40,9 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				                }

				                res.emplace_back(std::move(val));

				            });

				            return make_ready_future<json::json_return_type>(res);

				            return make_ready_future<json::json_return_type>(json::stream_range_as_array(res, [](const fd::endpoint_state& i){

				                return i;

				            }));

				        });

				    });

				@@ -64,11 +66,15 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_simple_states.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.container().invoke_on(0, [] (gms::gossiper& g) {

				            std::map<sstring, sstring> nodes_status;

				            g.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state&) {

				                nodes_status.emplace(fmt::to_string(node), g.is_alive(node) ? "UP" : "DOWN");

				            std::vector<fd::mapper> nodes_status;

				            nodes_status.reserve(g.num_endpoints());

				            g.for_each_endpoint_state([&] (const gms::endpoint_state& es) {

				                fd::mapper val;

				                val.key = fmt::to_string(es.get_ip());

				                val.value = g.is_alive(es.get_host_id()) ? "UP" : "DOWN";

				                nodes_status.emplace_back(std::move(val));

				            });

				            return make_ready_future<json::json_return_type>(map_to_key_value<fd::mapper>(nodes_status));

				            return make_ready_future<json::json_return_type>(std::move(nodes_status));

				        });

				    });

				@@ -81,7 +87,7 @@ void set_failure_detector(http_context& ctx, routes& r, gms::gossiper& g) {

				    fd::get_endpoint_state.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.container().invoke_on(0, [req = std::move(req)] (gms::gossiper& g) {

				            auto state = g.get_endpoint_state_ptr(gms::inet_address(req->get_path_param("addr")));

				            auto state = g.get_endpoint_state_ptr(g.get_host_id(gms::inet_address(req->get_path_param("addr"))));

				            if (!state) {

				                return make_ready_future<json::json_return_type>(format("unknown endpoint {}", req->get_path_param("addr")));

				            }

									
										24

api/gossiper.cc
									
												View File
												
				@@ -21,51 +21,45 @@ using namespace json;

				void set_gossiper(http_context& ctx, routes& r, gms::gossiper& g) {

				    httpd::gossiper_json::get_down_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto res = co_await g.get_unreachable_members_synchronized();

				        co_return json::json_return_type(container_to_vec(res));

				        co_return json::json_return_type(res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>());

				    });

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) {

				        return g.get_live_members_synchronized().then([] (auto res) {

				            return make_ready_future<json::json_return_type>(container_to_vec(res));

				        });

				    httpd::gossiper_json::get_live_endpoint.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto res = co_await g.get_live_members_synchronized();

				        co_return json::json_return_type(res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>());

				    });

				    httpd::gossiper_json::get_endpoint_downtime.set(r, [&g] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        gms::inet_address ep(req->get_path_param("addr"));

				        // synchronize unreachable_members on all shards

				        co_await g.get_unreachable_members_synchronized();

				        co_return g.get_endpoint_downtime(ep);

				        co_return g.get_endpoint_downtime(g.get_host_id(ep));

				    });

				    httpd::gossiper_json::get_current_generation_number.set(r, [&g] (std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->get_path_param("addr"));

				        return g.get_current_generation_number(ep).then([] (gms::generation_type res) {

				        return g.get_current_generation_number(g.get_host_id(ep)).then([] (gms::generation_type res) {

				            return make_ready_future<json::json_return_type>(res.value());

				        });

				    });

				    httpd::gossiper_json::get_current_heart_beat_version.set(r, [&g] (std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->get_path_param("addr"));

				        return g.get_current_heart_beat_version(ep).then([] (gms::version_type res) {

				        return g.get_current_heart_beat_version(g.get_host_id(ep)).then([] (gms::version_type res) {

				            return make_ready_future<json::json_return_type>(res.value());

				        });

				    });

				    httpd::gossiper_json::assassinate_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {

				        if (req->get_query_param("unsafe") != "True") {

				            return g.assassinate_endpoint(req->get_path_param("addr")).then([] {

				                return make_ready_future<json::json_return_type>(json_void());

				            });

				        }

				        return g.unsafe_assassinate_endpoint(req->get_path_param("addr")).then([] {

				        return g.assassinate_endpoint(req->get_path_param("addr")).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    httpd::gossiper_json::force_remove_endpoint.set(r, [&g](std::unique_ptr<http::request> req) {

				        gms::inet_address ep(req->get_path_param("addr"));

				        return g.force_remove_endpoint(ep, gms::null_permit_id).then([] {

				        return g.force_remove_endpoint(g.get_host_id(ep), gms::null_permit_id).then([] () {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

									
										2

api/messaging_service.cc
									
												View File
												
				@@ -148,7 +148,7 @@ void set_messaging_service(http_context& ctx, routes& r, sharded<netw::messaging

				    hf::inject_disconnect.set(r, [&ms] (std::unique_ptr<request> req) -> future<json::json_return_type> {

				        auto ip = msg_addr(req->get_path_param("ip"));

				        co_await ms.invoke_on_all([ip] (netw::messaging_service& ms) {

				            ms.remove_rpc_client(ip);

				            ms.remove_rpc_client(ip, std::nullopt);

				        });

				        co_return json::json_void();

				    });

									
										2

api/service_levels.cc
									
												View File
												
				@@ -11,7 +11,7 @@

				#include "cql3/query_processor.hh"

				#include "cql3/untyped_result_set.hh"

				#include "db/consistency_level_type.hh"

				#include "seastar/json/json_elements.hh"

				#include <seastar/json/json_elements.hh>

				#include "transport/controller.hh"

				#include <unordered_map>

									
										399

api/storage_service.cc
									
												View File
												
				@@ -12,11 +12,16 @@

				#include "api/api-doc/storage_service.json.hh"

				#include "api/api-doc/storage_proxy.json.hh"

				#include "api/scrub_status.hh"

				#include "api/tasks.hh"

				#include "db/config.hh"

				#include "db/schema_tables.hh"

				#include "gms/feature_service.hh"

				#include "schema/schema_builder.hh"

				#include "sstables/sstables_manager.hh"

				#include "utils/hash.hh"

				#include <optional>

				#include <sstream>

				#include <stdexcept>

				#include <time.h>

				#include <algorithm>

				#include <functional>

				@@ -29,6 +34,7 @@

				#include "service/raft/raft_group0_client.hh"

				#include "service/storage_service.hh"

				#include "service/load_meter.hh"

				#include "gms/feature_service.hh"

				#include "gms/gossiper.hh"

				#include "db/system_keyspace.hh"

				#include <seastar/http/exception.hh>

				@@ -55,6 +61,7 @@

				#include "db/view/view_builder.hh"

				#include "utils/rjson.hh"

				#include "utils/user_provided_param.hh"

				#include "sstable_dict_autotrainer.hh"

				using namespace seastar::httpd;

				using namespace std::chrono_literals;

				@@ -122,37 +129,26 @@ bool validate_bool(const sstring& param) {

				    }

				}

				bool validate_bool_x(const sstring& param, bool default_value) {

				    if (param.empty()) {

				        return default_value;

				    }

				    if (strcasecmp(param.c_str(), "true") == 0 || strcasecmp(param.c_str(), "yes") == 0 || param == "1") {

				        return true;

				    }

				    if (strcasecmp(param.c_str(), "false") == 0 || strcasecmp(param.c_str(), "no") == 0 || param == "0") {

				        return false;

				    }

				    throw std::runtime_error("Invalid boolean parameter value");

				}

				static

				int64_t validate_int(const sstring& param) {

				    return std::atoll(param.c_str());

				}

				// splits a request parameter assumed to hold a comma-separated list of table names

				// verify that the tables are found, otherwise a bad_param_exception exception is thrown

				// containing the description of the respective no_such_column_family error.

				static std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, sstring value) {

				    if (value.empty()) {

				        return map_keys(ctx.db.local().find_keyspace(ks_name).metadata().get()->cf_meta_data());

				    }

				    std::vector<sstring> names = split(value, ",");

				    try {

				        for (const auto& table_name : names) {

				            ctx.db.local().find_column_family(ks_name, table_name);

				        }

				    } catch (const replica::no_such_column_family& e) {

				        throw bad_param_exception(e.what());

				    }

				    return names;

				}

				static std::vector<sstring> parse_tables(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {

				    auto it = query_params.find(param_name);

				    if (it == query_params.end()) {

				        return {};

				    }

				    return parse_tables(ks_name, ctx, it->second);

				}

				std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value) {

				    std::vector<table_info> res;

				    try {

				@@ -178,9 +174,12 @@ std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_con

				    return res;

				}

				std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name) {

				    auto it = query_params.find(param_name);

				    return parse_table_infos(ks_name, ctx, it != query_params.end() ? it->second : "");

				std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name) {

				    auto keyspace = validate_keyspace(ctx, req);

				    const auto& query_params = req.query_parameters;

				    auto it = query_params.find(cf_param_name);

				    auto tis = parse_table_infos(keyspace, ctx, it != query_params.end() ? it->second : "");

				    return std::make_pair(std::move(keyspace), std::move(tis));

				}

				static ss::token_range token_range_endpoints_to_json(const dht::token_range_endpoints& d) {

				@@ -201,16 +200,6 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp

				    return r;

				}

				using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<http::request>, sstring, std::vector<table_info>)>;

				static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {

				    return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				    };

				}

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {

				    return q.scatter().then([&q, legacy_request] {

				        return sleep(q.duration()).then([&q, legacy_request] {

				@@ -243,28 +232,19 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition

				future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl, std::unique_ptr<http::request> req) {

				    scrub_info info;

				    auto rp = req_params({

				        {"keyspace", {mandatory::yes}},

				        {"cf", {""}},

				        {"scrub_mode", {}},

				        {"skip_corrupted", {}},

				        {"disable_snapshot", {}},

				        {"quarantine_mode", {}},

				    });

				    rp.process(*req);

				    info.keyspace = validate_keyspace(ctx, *rp.get("keyspace"));

				    info.column_families = parse_tables(info.keyspace, ctx, *rp.get("cf"));

				    auto scrub_mode_opt = rp.get("scrub_mode");

				    auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				    info.keyspace = std::move(keyspace);

				    info.column_families = table_infos | std::views::transform([] (auto ti) { return ti.name; }) | std::ranges::to<std::vector>();

				    auto scrub_mode_str = req->get_query_param("scrub_mode");

				    auto scrub_mode = sstables::compaction_type_options::scrub::mode::abort;

				    if (!scrub_mode_opt) {

				        const auto skip_corrupted = rp.get_as<bool>("skip_corrupted").value_or(false);

				    if (scrub_mode_str.empty()) {

				        const auto skip_corrupted = validate_bool_x(req->get_query_param("skip_corrupted"), false);

				        if (skip_corrupted) {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::skip;

				        }

				    } else {

				        auto scrub_mode_str = *scrub_mode_opt;

				        if (scrub_mode_str == "ABORT") {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::abort;

				        } else if (scrub_mode_str == "SKIP") {

				@@ -278,11 +258,9 @@ future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snap

				        }

				    }

				    if (!req_param<bool>(*req, "disable_snapshot", false)) {

				    if (!req_param<bool>(*req, "disable_snapshot", false) && !info.column_families.empty()) {

				        auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());

				        co_await coroutine::parallel_for_each(info.column_families, [&snap_ctl, keyspace = info.keyspace, tag](sstring cf) {

				            return snap_ctl.local().take_column_family_snapshot(keyspace, cf, tag, db::snapshot_ctl::skip_flush::no);

				        });

				        co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, tag, db::snapshot_ctl::skip_flush::no);

				    }

				    info.opts = {

				@@ -483,17 +461,27 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&

				        auto cf = req->get_query_param("cf");

				        auto stream = req->get_query_param("load_and_stream");

				        auto primary_replica = req->get_query_param("primary_replica_only");

				        auto skip_cleanup_p = req->get_query_param("skip_cleanup");

				        boost::algorithm::to_lower(stream);

				        boost::algorithm::to_lower(primary_replica);

				        bool load_and_stream = stream == "true" || stream == "1";

				        bool primary_replica_only = primary_replica == "true" || primary_replica == "1";

				        bool skip_cleanup = skip_cleanup_p == "true" || skip_cleanup_p == "1";

				        auto scope = parse_stream_scope(req->get_query_param("scope"));

				        auto skip_reshape_p = req->get_query_param("skip_reshape");

				        auto skip_reshape = skip_reshape_p == "true" || skip_reshape_p == "1";

				        if (scope != sstables_loader::stream_scope::all && !load_and_stream) {

				            throw httpd::bad_param_exception("scope takes no effect without load-and-stream");

				        }

				        // No need to add the keyspace, since all we want is to avoid always sending this to the same

				        // CPU. Even then I am being overzealous here. This is not something that happens all the time.

				        auto coordinator = std::hash<sstring>()(cf) % smp::count;

				        return sst_loader.invoke_on(coordinator,

				                [ks = std::move(ks), cf = std::move(cf),

				                load_and_stream, primary_replica_only] (sstables_loader& loader) {

				            return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only, sstables_loader::stream_scope::all);

				                load_and_stream, primary_replica_only, skip_cleanup, skip_reshape, scope] (sstables_loader& loader) {

				            return loader.load_new_sstables(ks, cf, load_and_stream, primary_replica_only, skip_cleanup, skip_reshape, scope);

				        }).then_wrapped([] (auto&& f) {

				            if (f.failed()) {

				                auto msg = fmt::format("Failed to load new sstables: {}", f.get_exception());

				@@ -725,7 +713,7 @@ rest_get_load(http_context& ctx, std::unique_ptr<http::request> req) {

				static

				future<json::json_return_type>

				rest_get_current_generation_number(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        auto ep = ss.local().get_token_metadata().get_topology().my_address();

				        auto ep = ss.local().get_token_metadata().get_topology().my_host_id();

				        return ss.local().gossiper().get_current_generation_number(ep).then([](gms::generation_type res) {

				            return make_ready_future<json::json_return_type>(res.value());

				        });

				@@ -735,8 +723,8 @@ static

				json::json_return_type

				rest_get_natural_endpoints(http_context& ctx, sharded<service::storage_service>& ss, const_req req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        return container_to_vec(ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"),

				                req.get_query_param("key")));

				        auto res = ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"), req.get_query_param("key"));

				        return res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>();

				}

				static

				@@ -753,13 +741,8 @@ static

				future<json::json_return_type>

				rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto params = req_params({

				            std::pair("flush_memtables", mandatory::no),

				            std::pair("consider_only_existing_data", mandatory::no),

				        });

				        params.process(*req);

				        auto flush = params.get_as<bool>("flush_memtables").value_or(true);

				        auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				        apilog.info("force_compaction: flush={} consider_only_existing_data={}", flush, consider_only_existing_data);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				@@ -768,86 +751,39 @@ rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);

				        try {

				            co_await task->done();

				        } catch (...) {

				            apilog.error("force_compaction failed: {}", std::current_exception());

				            throw;

				        }

				        co_await task->done();

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto params = req_params({

				            std::pair("keyspace", mandatory::yes),

				            std::pair("cf", mandatory::no),

				            std::pair("flush_memtables", mandatory::no),

				            std::pair("consider_only_existing_data", mandatory::no),

				        });

				        params.process(*req);

				        auto keyspace = validate_keyspace(ctx, *params.get("keyspace"));

				        auto table_infos = parse_table_infos(keyspace, ctx, params.get("cf").value_or(""));

				        auto flush = params.get_as<bool>("flush_memtables").value_or(true);

				        auto consider_only_existing_data = params.get_as<bool>("consider_only_existing_data").value_or(false);

				        apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<flush_mode> fmopt;

				        if (!flush && !consider_only_existing_data) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);

				        try {

				            co_await task->done();

				        } catch (...) {

				            apilog.error("force_keyspace_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());

				            throw;

				        }

				        auto task = co_await force_keyspace_compaction(ctx, std::move(req));

				        co_await task->done();

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();

				        if (rs.get_type() == locator::replication_strategy_type::local || !rs.is_vnode_based()) {

				            auto reason = rs.get_type() == locator::replication_strategy_type::local ? "require" : "support";

				            apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);

				            co_return json::json_return_type(0);

				        }

				        apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);

				        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

				            auto msg = "Can not perform cleanup operation when topology changes";

				            apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				            co_await coroutine::return_exception(std::runtime_error(msg));

				        }

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>(

				            {}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);

				        try {

				        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));

				        if (task) {

				            co_await task->done();

				        } catch (...) {

				            apilog.error("force_keyspace_cleanup: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());

				            throw;

				        }

				        co_return json::json_return_type(0);

				}

				static

				future<json::json_return_type>

				rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        apilog.info("cleanup_all");

				        auto done = co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {

				        bool global = true;

				        if (auto global_param = req->get_query_param("global"); !global_param.empty()) {

				            global = validate_bool(global_param);

				        }

				        apilog.info("cleanup_all global={}", global);

				        auto done = !global ? false : co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {

				            if (!ss.is_topology_coordinator_enabled()) {

				                co_return false;

				            }

				@@ -857,53 +793,53 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::

				        if (done) {

				            co_return json::json_return_type(0);

				        }

				        // fall back to the local global cleanup if topology coordinator is not enabled

				        // fall back to the local cleanup if topology coordinator is not enabled or local cleanup is requested

				        auto& db = ctx.db;

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<global_cleanup_compaction_task_impl>({}, db);

				        try {

				            co_await task->done();

				        } catch (...) {

				            apilog.error("cleanup_all failed: {}", std::current_exception());

				            throw;

				        }

				        co_await task->done();

				        // Mark this node as clean

				        co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {

				            if (ss.is_topology_coordinator_enabled()) {

				                co_await ss.reset_cleanup_needed();

				            }

				        });

				        co_return json::json_return_type(0);

				}

				static

				future<json::json_return_type>

				rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {

				rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        apilog.info("reset_cleanup_needed");

				        co_await ss.invoke_on(0, [] (service::storage_service& ss) {

				            if (!ss.is_topology_coordinator_enabled()) {

				                throw std::runtime_error("mark_node_as_clean is only supported when topology over raft is enabled");

				            }

				            return ss.reset_cleanup_needed();

				        });

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        bool res = false;

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);

				        try {

				            co_await task->done();

				        } catch (...) {

				            apilog.error("perform_keyspace_offstrategy_compaction: keyspace={} tables={} failed: {}", task->get_status().keyspace, table_infos, std::current_exception());

				            throw;

				        }

				        co_await task->done();

				        co_return json::json_return_type(res);

				}

				static

				future<json::json_return_type>

				rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {

				        auto& db = ctx.db;

				        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				        try {

				            co_await task->done();

				        } catch (...) {

				            apilog.error("upgrade_sstables: keyspace={} tables={} failed: {}", keyspace, table_infos, std::current_exception());

				            throw;

				        }

				rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				        co_await task->done();

				        co_return json::json_return_type(0);

				}

				@@ -920,15 +856,10 @@ rest_force_flush(http_context& ctx, std::unique_ptr<http::request> req) {

				static

				future<json::json_return_type>

				rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto column_families = parse_tables(keyspace, ctx, req->query_parameters, "cf");

				        apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, column_families);

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        apilog.info("perform_keyspace_flush: keyspace={} tables={}", keyspace, table_infos);

				        auto& db = ctx.db;

				        if (column_families.empty()) {

				            co_await replica::database::flush_keyspace_on_all_shards(db, keyspace);

				        } else {

				            co_await replica::database::flush_tables_on_all_shards(db, keyspace, std::move(column_families));

				        }

				        co_await replica::database::flush_tables_on_all_shards(db, std::move(table_infos));

				        co_return json_void();

				}

				@@ -1060,7 +991,7 @@ rest_get_keyspaces(http_context& ctx, const_req req) {

				        } else if (type == "non_local_strategy") {

				            keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();

				        } else {

				            keyspaces = map_keys(ctx.db.local().get_keyspaces());

				            keyspaces = ctx.db.local().get_all_keyspaces();

				        }

				        if (replication.empty() || replication == "all") {

				            return keyspaces;

				@@ -1448,6 +1379,95 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service

				        });

				}

				static

				future<json::json_return_type>

				rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				    if (!ss.local().get_feature_service().sstable_compression_dicts) {

				        apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");

				        throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");

				    }

				    auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);

				    auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;

				    auto cf = api::req_param<sstring>(*req, "cf", {}).value;

				    apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);

				    auto s = ctx.db.local().find_column_family(ks, cf).schema();

				    auto training_sample = co_await ss.local().do_sample_sstables(s->id(), 4096, 4096);

				    auto validation_sample = co_await ss.local().do_sample_sstables(s->id(), 16*1024, 1024);

				    apilog.debug("estimate_compression_ratios: got training sample with {} blocks and validation sample with {}", training_sample.size(), validation_sample.size());

				    auto dict = co_await ss.local().train_dict(std::move(training_sample));

				    apilog.debug("estimate_compression_ratios: got dict of size {}", dict.size());

				    std::vector<ss::compression_config_result> res;

				    auto make_result = [](std::string_view name, int chunk_length_kb, std::string_view dict, int level, float ratio) -> ss::compression_config_result {

				        ss::compression_config_result x;

				        x.sstable_compression = sstring(name);

				        x.chunk_length_in_kb = chunk_length_kb;

				        x.dict = sstring(dict);

				        x.level = level;

				        x.ratio = ratio;

				        return x;

				    };

				    using algorithm = compression_parameters::algorithm;

				    for (const auto& algo : {algorithm::lz4_with_dicts, algorithm::zstd_with_dicts}) {

				        for (const auto& chunk_size_kb : {1, 4, 16}) {

				            std::vector<int> levels;

				            if (algo == compressor::algorithm::zstd_with_dicts) {

				                for (int i = 1; i <= 5; ++i) {

				                    levels.push_back(i);

				                }

				            } else {

				                levels.push_back(1);

				            }

				            for (auto level : levels) {

				                auto algo_name = compression_parameters::algorithm_to_name(algo);

				                auto m = std::map<sstring, sstring>{

				                    {compression_parameters::CHUNK_LENGTH_KB, std::to_string(chunk_size_kb)},

				                    {compression_parameters::SSTABLE_COMPRESSION, sstring(algo_name)},

				                };

				                if (algo == compressor::algorithm::zstd_with_dicts) {

				                    m.insert(decltype(m)::value_type{sstring("compression_level"), sstring(std::to_string(level))});

				                }

				                auto params = compression_parameters(std::move(m));

				                auto ratio_with_no_dict = co_await try_one_compression_config({}, s, params, validation_sample);

				                auto ratio_with_past_dict = co_await try_one_compression_config(ctx.db.local().get_user_sstables_manager().get_compressor_factory(), s, params, validation_sample);

				                auto ratio_with_future_dict = co_await try_one_compression_config(dict, s, params, validation_sample);

				                res.push_back(make_result(algo_name, chunk_size_kb, "none", level, ratio_with_no_dict));

				                res.push_back(make_result(algo_name, chunk_size_kb, "past", level, ratio_with_past_dict));

				                res.push_back(make_result(algo_name, chunk_size_kb, "future", level, ratio_with_future_dict));

				            }

				        }

				    }

				    co_return res;

				}

				static

				future<json::json_return_type>

				rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {

				    if (!ss.local().get_feature_service().sstable_compression_dicts) {

				        apilog.warn("retrain_dict: called before the cluster feature was enabled");

				        throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");

				    }

				    auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);

				    auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;

				    auto cf = api::req_param<sstring>(*req, "cf", {}).value;

				    apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);

				    const auto t_id = ctx.db.local().find_column_family(ks, cf).schema()->id();

				    constexpr uint64_t chunk_size = 4096;

				    constexpr uint64_t n_chunks = 4096;

				    auto sample = co_await ss.local().do_sample_sstables(t_id, chunk_size, n_chunks);

				    apilog.debug("retrain_dict: got sample with {} blocks", sample.size());

				    auto dict = co_await ss.local().train_dict(std::move(sample));

				    apilog.debug("retrain_dict: got dict of size {}", dict.size());

				    co_await ss.local().publish_new_sstable_dict(t_id, dict, group0_client);

				    apilog.debug("retrain_dict: published new dict");

				    co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {

				@@ -1509,21 +1529,23 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {

				                            info.version = sstable->get_version();

				                            if (sstable->has_component(sstables::component_type::CompressionInfo)) {

				                                auto& c = sstable->get_compression();

				                                auto cp = sstables::get_sstable_compressor(c);

				                                const auto& cp = sstable->get_compression().get_compressor();

				                                ss::named_maps nm;

				                                nm.group = "compression_parameters";

				                                for (auto& p : cp->options()) {

				                                for (auto& p : cp.options()) {

				                                    if (compressor::is_hidden_option_name(p.first)) {

				                                        continue;

				                                    }

				                                    ss::mapper e;

				                                    e.key = p.first;

				                                    e.value = p.second;

				                                    nm.attributes.push(std::move(e));

				                                }

				                                if (!cp->options().contains(compression_parameters::SSTABLE_COMPRESSION)) {

				                                if (!cp.options().contains(compression_parameters::SSTABLE_COMPRESSION)) {

				                                    ss::mapper e;

				                                    e.key = compression_parameters::SSTABLE_COMPRESSION;

				                                    e.value = cp->name();

				                                    e.value = sstring(cp.name());

				                                    nm.attributes.push(std::move(e));

				                                }

				                                info.extended_properties.push(std::move(nm));

				@@ -1610,6 +1632,18 @@ rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::un

				        co_return sstring(format("{}", ustate));

				}

				static

				future<json::json_return_type>

				rest_raft_topology_get_cmd_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        const auto status = co_await ss.invoke_on(0, [] (auto& ss) {

				            return ss.get_topology_cmd_status();

				        });

				        if (status.active_dst.empty()) {

				            co_return sstring("none");

				        }

				        co_return sstring(fmt::format("{}[{}]: {}", status.current, status.index, fmt::join(status.active_dst, ",")));

				}

				static

				future<json::json_return_type>

				rest_move_tablet(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				@@ -1640,7 +1674,7 @@ rest_add_tablet_replica(http_context& ctx, sharded<service::storage_service>& ss

				        auto token = dht::token::from_int64(validate_int(req->get_query_param("token")));

				        auto ks = req->get_query_param("ks");

				        auto table = req->get_query_param("table");

				        auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();

				        auto table_id = validate_table(ctx.db.local(), ks, table);

				        auto force_str = req->get_query_param("force");

				        auto force = service::loosen_constraints(force_str == "" ? false : validate_bool(force_str));

				@@ -1659,7 +1693,7 @@ rest_del_tablet_replica(http_context& ctx, sharded<service::storage_service>& ss

				        auto token = dht::token::from_int64(validate_int(req->get_query_param("token")));

				        auto ks = req->get_query_param("ks");

				        auto table = req->get_query_param("table");

				        auto table_id = ctx.db.local().find_column_family(ks, table).schema()->id();

				        auto table_id = validate_table(ctx.db.local(), ks, table);

				        auto force_str = req->get_query_param("force");

				        auto force = service::loosen_constraints(force_str == "" ? false : validate_bool(force_str));

				@@ -1736,6 +1770,7 @@ future<json::json_return_type>

				rest_get_schema_versions(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        return ss.local().describe_schema_versions().then([] (auto result) {

				            std::vector<sp::mapper_list> res;

				            res.reserve(result.size());

				            for (auto e : result) {

				                sp::mapper_list entry;

				                entry.key = std::move(e.first);

				@@ -1767,12 +1802,6 @@ rest_bind(FuncType func, BindArgs&... args) {

				    return std::bind_front(func, std::ref(args)...);

				}

				static

				seastar::httpd::future_json_function

				rest_bind(ks_cf_func func, http_context& ctx) {

				    return wrap_ks_cf(ctx, func);

				}

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {

				    ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));

				    ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));

				@@ -1790,6 +1819,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				    ss::force_keyspace_compaction.set(r, rest_bind(rest_force_keyspace_compaction, ctx));

				    ss::force_keyspace_cleanup.set(r, rest_bind(rest_force_keyspace_cleanup, ctx, ss));

				    ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));

				    ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));

				    ss::perform_keyspace_offstrategy_compaction.set(r, rest_bind(rest_perform_keyspace_offstrategy_compaction, ctx));

				    ss::upgrade_sstables.set(r, rest_bind(rest_upgrade_sstables, ctx));

				    ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));

				@@ -1841,10 +1871,13 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				    ss::get_total_hints.set(r, rest_bind(rest_get_total_hints));

				    ss::get_ownership.set(r, rest_bind(rest_get_ownership, ctx, ss));

				    ss::get_effective_ownership.set(r, rest_bind(rest_get_effective_ownership, ctx, ss));

				    ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));

				    ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));

				    ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));

				    ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));

				    ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));

				    ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));

				    ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));

				    ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));

				    ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));

				    ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));

				@@ -1871,6 +1904,7 @@ void unset_storage_service(http_context& ctx, routes& r) {

				    ss::force_keyspace_compaction.unset(r);

				    ss::force_keyspace_cleanup.unset(r);

				    ss::cleanup_all.unset(r);

				    ss::reset_cleanup_needed.unset(r);

				    ss::perform_keyspace_offstrategy_compaction.unset(r);

				    ss::upgrade_sstables.unset(r);

				    ss::force_flush.unset(r);

				@@ -1926,6 +1960,7 @@ void unset_storage_service(http_context& ctx, routes& r) {

				    ss::reload_raft_topology_state.unset(r);

				    ss::upgrade_to_raft_topology.unset(r);

				    ss::raft_topology_upgrade_status.unset(r);

				    ss::raft_topology_get_cmd_status.unset(r);

				    ss::move_tablet.unset(r);

				    ss::add_tablet_replica.unset(r);

				    ss::del_tablet_replica.unset(r);

									
										10

api/storage_service.hh
									
												View File
												
				@@ -52,10 +52,11 @@ table_id validate_table(const replica::database& db, sstring ks_name, sstring ta

				// containing the description of the respective no_such_column_family error.

				// Returns a vector of all table infos given by the parameter, or

				// if the parameter is not found or is empty, returns a list of all table infos in the keyspace.

				std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, const std::unordered_map<sstring, sstring>& query_params, sstring param_name);

				std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_context& ctx, sstring value);

				std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name = "cf");

				struct scrub_info {

				    sstables::compaction_type_options::scrub opts;

				    sstring keyspace;

				@@ -82,4 +83,11 @@ void set_load_meter(http_context& ctx, httpd::routes& r, service::load_meter& lm

				void unset_load_meter(http_context& ctx, httpd::routes& r);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

				// converts string value of boolean parameter into bool

				// maps (case insensitively)

				//     "true", "yes" and "1" into true

				//     "false", "no" and "0" into false

				// otherwise throws runtime_error

				bool validate_bool_x(const sstring& param, bool default_value);

				} // namespace api

									
										96

api/tasks.cc
									
												View File
												
				@@ -31,51 +31,70 @@ using ks_cf_func = std::function<future<json::json_return_type>(http_context&, s

				static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {

				    return [&ctx, f = std::move(f)](std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        return f(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				    };

				}

				future<tasks::task_manager::task_ptr> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				    auto& db = ctx.db;

				    auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				    auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				    auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				    apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    std::optional<compaction::flush_mode> fmopt;

				    if (!flush && !consider_only_existing_data) {

				        fmopt = compaction::flush_mode::skip;

				    }

				    return compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);

				}

				future<tasks::task_manager::task_ptr> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {

				    auto& db = ctx.db;

				    bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				    apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    return compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				}

				future<tasks::task_manager::task_ptr> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				    auto& db = ctx.db;

				    auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				    const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();

				    if (rs.get_type() == locator::replication_strategy_type::local || !rs.is_vnode_based()) {

				        auto reason = rs.get_type() == locator::replication_strategy_type::local ? "require" : "support";

				        apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);

				        co_return nullptr;

				    }

				    apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);

				    if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

				        auto msg = "Can not perform cleanup operation when topology changes";

				        apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				        co_await coroutine::return_exception(std::runtime_error(msg));

				    }

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    co_return co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(

				        {}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);

				}

				void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {

				    t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto params = req_params({

				            std::pair("keyspace", mandatory::yes),

				            std::pair("cf", mandatory::no),

				            std::pair("flush_memtables", mandatory::no),

				        });

				        params.process(*req);

				        auto keyspace = validate_keyspace(ctx, *params.get("keyspace"));

				        auto table_infos = parse_table_infos(keyspace, ctx, params.get("cf").value_or(""));

				        auto flush = params.get_as<bool>("flush_memtables").value_or(true);

				        apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<flush_mode> fmopt;

				        if (!flush) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);

				        auto task = co_await force_keyspace_compaction(ctx, std::move(req));

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    });

				    t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table_infos = parse_table_infos(keyspace, ctx, req->query_parameters, "cf");

				        apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);

				        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

				            auto msg = "Can not perform cleanup operation when topology changes";

				            apilog.warn("force_keyspace_cleanup_async: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				            co_await coroutine::return_exception(std::runtime_error(msg));

				        tasks::task_id id = tasks::task_id::create_null_id();

				        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));

				        if (task) {

				            id = task->get_status().id;

				        }

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				        co_return json::json_return_type(id.to_sstring());

				    });

				    t::perform_keyspace_offstrategy_compaction_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				@@ -87,14 +106,7 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::

				    }));

				    t::upgrade_sstables_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    }));

									
										8

api/tasks.hh
									
												View File
												
				@@ -15,6 +15,10 @@ namespace seastar::httpd {

				class routes;

				}

				namespace seastar::http {

				struct request;

				}

				namespace service {

				class storage_service;

				}

				@@ -25,4 +29,8 @@ struct http_context;

				void set_tasks_compaction_module(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl);

				void unset_tasks_compaction_module(http_context& ctx, httpd::routes& r);

				future<tasks::task_manager::task_ptr> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req);

				future<tasks::task_manager::task_ptr> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req);

				future<tasks::task_manager::task_ptr> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos);

				}

									
										22

api/token_metadata.cc
									
												View File
												
				@@ -54,12 +54,12 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to

				        for (const auto host_id: leaving_host_ids) {

				            eps.insert(g.local().get_address_map().get(host_id));

				        }

				        return container_to_vec(eps);

				        return eps | std::views::transform([] (auto& i) { return fmt::to_string(i); }) | std::ranges::to<std::vector>();

				    });

				    ss::get_moving_nodes.set(r, [](const_req req) {

				        std::unordered_set<sstring> addr;

				        return container_to_vec(addr);

				        return addr | std::ranges::to<std::vector>();

				    });

				    ss::get_joining_nodes.set(r, [&tm, &g](const_req req) {

				@@ -70,15 +70,21 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to

				        for (const auto& [token, host_id]: points) {

				            eps.insert(g.local().get_address_map().get(host_id));

				        }

				        return container_to_vec(eps);

				        return eps | std::views::transform([] (auto& i) { return fmt::to_string(i); }) | std::ranges::to<std::vector>();

				    });

				    ss::get_host_id_map.set(r, [&tm, &g](const_req req) {

				        std::vector<ss::mapper> res;

				        auto map = tm.local().get()->get_host_ids() |

				            std::views::transform([&g] (locator::host_id id) { return std::make_pair(g.local().get_address_map().get(id), id); }) |

				            std::ranges::to<std::unordered_map>();

				        return map_to_key_value(std::move(map), res);

				        if (!g.local().is_enabled()) {

				            throw std::runtime_error("The gossiper is not ready yet");

				        }

				        return tm.local().get()->get_host_ids()

				            | std::views::transform([&g] (locator::host_id id) {

				                ss::mapper m;

				                m.key = fmt::to_string(g.local().get_address_map().get(id));

				                m.value = fmt::to_string(id);

				                return m;

				            })

				            | std::ranges::to<std::vector<ss::mapper>>();

				    });

				    static auto host_or_broadcast = [&tm](const_req req) {

									
										9

audit/audit.cc
									
												View File
												
				@@ -209,6 +209,11 @@ future<> audit::log(const audit_info* audit_info, service::query_state& query_st

				    static const sstring anonymous_username("anonymous");

				    const sstring& username = client_state.user() ? client_state.user()->name.value_or(anonymous_username) : no_username;

				    socket_address client_ip = client_state.get_client_address().addr();

				    if (logger.is_enabled(logging::log_level::debug)) {

				        logger.debug("Log written: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",

				            node_ip, audit_info->category_string(), cl, error, audit_info->keyspace(),

				            audit_info->query(), client_ip, audit_info->table(), username);

				    }

				    return futurize_invoke(std::mem_fn(&storage_helper::write), _storage_helper_ptr, audit_info, node_ip, client_ip, cl, username, error)

				        .handle_exception([audit_info, node_ip, client_ip, cl, username, error] (auto ep) {

				            logger.error("Unexpected exception when writing log with: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {} exception {}",

				@@ -219,6 +224,10 @@ future<> audit::log(const audit_info* audit_info, service::query_state& query_st

				future<> audit::log_login(const sstring& username, socket_address client_ip, bool error) noexcept {

				    socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();

				    if (logger.is_enabled(logging::log_level::debug)) {

				        logger.debug("Login log written: node_ip {}, client_ip {}, username {}, error {}",

				            node_ip, client_ip, username, error ? "true" : "false");

				    }

				    return futurize_invoke(std::mem_fn(&storage_helper::write_login), _storage_helper_ptr, username, node_ip, client_ip, error)

				        .handle_exception([username, node_ip, client_ip, error] (auto ep) {

				            logger.error("Unexpected exception when writing login log with: node_ip {} client_ip {} username {} error {} exception {}",

									
										63

audit/audit_syslog_storage_helper.cc
									
												View File
												
				@@ -33,20 +33,6 @@ namespace audit {

				namespace {

				future<> syslog_send_helper(net::datagram_channel& sender,

				                            const socket_address& address,

				                            const sstring& msg) {

				    return sender.send(address, net::packet{msg.data(), msg.size()}).handle_exception([address](auto&& exception_ptr) {

				        auto error_msg = seastar::format(

				            "Syslog audit backend failed (sending a message to {} resulted in {}).",

				            address,

				            exception_ptr

				        );

				        logger.error("{}", error_msg);

				        throw audit_exception(std::move(error_msg));

				    });

				}

				static auto syslog_address_helper(const db::config& cfg)

				{

				    return cfg.audit_unix_socket_path.is_set()

				@@ -54,11 +40,40 @@ static auto syslog_address_helper(const db::config& cfg)

				        : unix_domain_addr(_PATH_LOG);

				}

				static std::string json_escape(std::string_view str) {

				    std::string result;

				    result.reserve(str.size() * 1.2);

				    for (auto c : str) {

				        if (c == '"' || c == '\\') {

				            result.push_back('\\');

				        }

				        result.push_back(c);

				    }

				    return result;

				}

				}

				future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {

				    try {

				        auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));

				        co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});

				    }

				    catch (const std::exception& e) {

				        auto error_msg = seastar::format(

				            "Syslog audit backend failed (sending a message to {} resulted in {}).",

				            _syslog_address,

				            e

				        );

				        logger.error("{}", error_msg);

				        throw audit_exception(std::move(error_msg));

				    }

				}

				audit_syslog_storage_helper::audit_syslog_storage_helper(cql3::query_processor& qp, service::migration_manager&) :

				    _syslog_address(syslog_address_helper(qp.db().get_config())),

				    _sender(make_unbound_datagram_channel(AF_UNIX)) {

				    _sender(make_unbound_datagram_channel(AF_UNIX)),

				    _semaphore(1) {

				}

				audit_syslog_storage_helper::~audit_syslog_storage_helper() {

				@@ -73,10 +88,10 @@ audit_syslog_storage_helper::~audit_syslog_storage_helper() {

				 */

				future<> audit_syslog_storage_helper::start(const db::config& cfg) {

				    if (this_shard_id() != 0) {

				        return make_ready_future();

				        co_return;

				    }

				    return syslog_send_helper(_sender, _syslog_address, "Initializing syslog audit backend.");

				    co_await syslog_send_helper("Initializing syslog audit backend.");

				}

				future<> audit_syslog_storage_helper::stop() {

				@@ -93,7 +108,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,

				    auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());

				    tm time;

				    localtime_r(&now, &time);

				    sstring msg = seastar::format("<{}>{:%h %e %T} scylla-audit: \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\", \"{}\"",

				    sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}" category="{}" cl="{}" error="{}" keyspace="{}" query="{}" client_ip="{}" table="{}" username="{}")",

				                                    LOG_NOTICE | LOG_USER,

				                                    time,

				                                    node_ip,

				@@ -101,12 +116,12 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,

				                                    cl,

				                                    (error ? "true" : "false"),

				                                    audit_info->keyspace(),

				                                    audit_info->query(),

				                                    json_escape(audit_info->query()),

				                                    client_ip,

				                                    audit_info->table(),

				                                    username);

				    return syslog_send_helper(_sender, _syslog_address, msg);

				    co_await syslog_send_helper(msg);

				}

				future<> audit_syslog_storage_helper::write_login(const sstring& username,

				@@ -117,15 +132,15 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,

				    auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());

				    tm time;

				    localtime_r(&now, &time);

				    sstring msg = seastar::format("<{}>{:%h %e %T} scylla-audit: \"{}\", \"AUTH\", \"\", \"\", \"\", \"\", \"{}\", \"{}\", \"{}\"",

				    sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}", category="AUTH", cl="", error="{}", keyspace="", query="", client_ip="{}", table="", username="{}")",

				                                    LOG_NOTICE | LOG_USER,

				                                    time,

				                                    node_ip,

				                                    (error ? "true" : "false"),

				                                    client_ip,

				                                    username,

				                                    (error ? "true" : "false"));

				                                    username);

				    co_await syslog_send_helper(_sender, _syslog_address, msg.c_str());

				    co_await syslog_send_helper(msg.c_str());

				}

				using registry = class_registrator<storage_helper, audit_syslog_storage_helper, cql3::query_processor&, service::migration_manager&>;

									
										3

audit/audit_syslog_storage_helper.hh
									
												View File
												
				@@ -24,6 +24,9 @@ namespace audit {

				class audit_syslog_storage_helper : public storage_helper {

				    socket_address _syslog_address;

				    net::datagram_channel _sender;

				    seastar::semaphore _semaphore;

				    future<> syslog_send_helper(const sstring& msg);

				public:

				    explicit audit_syslog_storage_helper(cql3::query_processor&, service::migration_manager&);

				    virtual ~audit_syslog_storage_helper();

									
										4

auth/allow_all_authenticator.cc
									
												View File
												
				@@ -9,6 +9,7 @@

				#include "auth/allow_all_authenticator.hh"

				#include "service/migration_manager.hh"

				#include "utils/alien_worker.hh"

				#include "utils/class_registrator.hh"

				namespace auth {

				@@ -21,6 +22,7 @@ static const class_registrator<

				        allow_all_authenticator,

				        cql3::query_processor&,

				        ::service::raft_group0_client&,

				        ::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");

				        ::service::migration_manager&,

				        utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");

				}

									
										3

auth/allow_all_authenticator.hh
									
												View File
												
				@@ -13,6 +13,7 @@

				#include "auth/authenticated_user.hh"

				#include "auth/authenticator.hh"

				#include "auth/common.hh"

				#include "utils/alien_worker.hh"

				namespace cql3 {

				class query_processor;

				@@ -28,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;

				class allow_all_authenticator final : public authenticator {

				public:

				    allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {

				    allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&) {

				    }

				    virtual future<> start() override {

									
										5

auth/certificate_authenticator.cc
									
												View File
												
				@@ -33,13 +33,14 @@ static const class_registrator<auth::authenticator

				    , auth::certificate_authenticator

				    , cql3::query_processor&

				    , ::service::raft_group0_client&

				    , ::service::migration_manager&> cert_auth_reg(CERT_AUTH_NAME);

				    , ::service::migration_manager&

				    , utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);

				enum class auth::certificate_authenticator::query_source {

				    subject, altname

				};

				auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)

				auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)

				    : _queries([&] {

				        auto& conf = qp.db().get_config();

				        auto queries = conf.auth_certificate_role_queries();

									
										3

auth/certificate_authenticator.hh
									
												View File
												
				@@ -10,6 +10,7 @@

				#pragma once

				#include "auth/authenticator.hh"

				#include "utils/alien_worker.hh"

				#include <boost/regex_fwd.hpp>  // IWYU pragma: keep

				namespace cql3 {

				@@ -31,7 +32,7 @@ class certificate_authenticator : public authenticator {

				    enum class query_source;

				    std::vector<std::pair<query_source, boost::regex>> _queries;

				public:

				    certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);

				    certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);

				    ~certificate_authenticator();

				    future<> start() override;

									
										5

auth/common.cc
									
												View File
												
				@@ -119,6 +119,11 @@ future<> create_legacy_metadata_table_if_missing(

				    return qs;

				}

				::service::raft_timeout get_raft_timeout() noexcept {

				    auto dur = internal_distributed_query_state().get_client_state().get_timeout_config().other_timeout;

				    return ::service::raft_timeout{.value = lowres_clock::now() + dur};

				}

				static future<> announce_mutations_with_guard(

				        ::service::raft_group0_client& group0_client,

				        std::vector<canonical_mutation> muts,

									
										3

auth/common.hh
									
												View File
												
				@@ -17,6 +17,7 @@

				#include "types/types.hh"

				#include "service/raft/raft_group0_client.hh"

				#include "timeout_config.hh"

				using namespace std::chrono_literals;

				@@ -77,6 +78,8 @@ future<> create_legacy_metadata_table_if_missing(

				///

				::service::query_state& internal_distributed_query_state() noexcept;

				::service::raft_timeout get_raft_timeout() noexcept;

				// Execute update query via group0 mechanism, mutations will be applied on all nodes.

				// Use this function when need to perform read before write on a single guard or if

				// you have more than one mutation and potentially exceed single command size limit.

									
										19

auth/ldap_role_manager.cc
									
												View File
												
				@@ -233,9 +233,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,

				}

				future<role_to_directly_granted_map>

				ldap_role_manager::query_all_directly_granted() {

				ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {

				    role_to_directly_granted_map result;

				    auto roles = co_await query_all();

				    auto roles = co_await query_all(qs);

				    for (auto& role: roles) {

				        auto granted_set = co_await query_granted(role, recursive_role_query::no);

				        for (auto& granted: granted_set) {

				@@ -247,8 +247,8 @@ ldap_role_manager::query_all_directly_granted() {

				    co_return result;

				}

				future<role_set> ldap_role_manager::query_all() {

				    return _std_mgr.query_all();

				future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {

				    return _std_mgr.query_all(qs);

				}

				future<> ldap_role_manager::create_role(std::string_view role_name) {

				@@ -311,12 +311,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {

				}

				future<std::optional<sstring>> ldap_role_manager::get_attribute(

				        std::string_view role_name, std::string_view attribute_name) {

				    return _std_mgr.get_attribute(role_name, attribute_name);

				        std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {

				    return _std_mgr.get_attribute(role_name, attribute_name, qs);

				}

				future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {

				    return _std_mgr.query_attribute_for_all(attribute_name);

				future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {

				    return _std_mgr.query_attribute_for_all(attribute_name, qs);

				}

				future<> ldap_role_manager::set_attribute(

				@@ -338,8 +338,7 @@ future<std::vector<cql3::description>> ldap_role_manager::describe_role_grants()

				}

				future<> ldap_role_manager::ensure_superuser_is_created() {

				    // ldap is responsible for users

				    co_return;

				    return _std_mgr.ensure_superuser_is_created();

				}

				} // namespace auth

									
										8

auth/ldap_role_manager.hh
									
												View File
												
				@@ -75,9 +75,9 @@ class ldap_role_manager : public role_manager {

				    future<role_set> query_granted(std::string_view, recursive_role_query) override;

				    future<role_to_directly_granted_map> query_all_directly_granted() override;

				    future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;

				    future<role_set> query_all() override;

				    future<role_set> query_all(::service::query_state&) override;

				    future<bool> exists(std::string_view) override;

				@@ -85,9 +85,9 @@ class ldap_role_manager : public role_manager {

				    future<bool> can_login(std::string_view) override;

				    future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;

				    future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;

				    future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;

				    future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;

				    future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

									
										8

auth/maintenance_socket_role_manager.cc
									
												View File
												
				@@ -78,11 +78,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view

				    return operation_not_supported_exception<role_set>("QUERY GRANTED");

				}

				future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {

				future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {

				    return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");

				}

				future<role_set> maintenance_socket_role_manager::query_all() {

				future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {

				    return operation_not_supported_exception<role_set>("QUERY ALL");

				}

				@@ -98,11 +98,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na

				    return make_ready_future<bool>(true);

				}

				future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {

				future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {

				    return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");

				}

				future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {

				future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {

				    return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");

				}

									
										8

auth/maintenance_socket_role_manager.hh
									
												View File
												
				@@ -53,9 +53,9 @@ public:

				    virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;

				    virtual future<role_to_directly_granted_map> query_all_directly_granted() override;

				    virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;

				    virtual future<role_set> query_all() override;

				    virtual future<role_set> query_all(::service::query_state&) override;

				    virtual future<bool> exists(std::string_view role_name) override;

				@@ -63,9 +63,9 @@ public:

				    virtual future<bool> can_login(std::string_view role_name) override;

				    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;

				    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;

				    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;

				    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;

				    virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

									
										108

auth/password_authenticator.cc
									
												View File
												
				@@ -48,14 +48,14 @@ static const class_registrator<

				        password_authenticator,

				        cql3::query_processor&,

				        ::service::raft_group0_client&,

				        ::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");

				        ::service::migration_manager&,

				        utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");

				static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());

				static std::string_view get_config_value(std::string_view value, std::string_view def) {

				    return value.empty() ? def : value;

				}

				std::string password_authenticator::default_superuser(const db::config& cfg) {

				    return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));

				}

				@@ -63,12 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {

				password_authenticator::~password_authenticator() {

				}

				password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)

				password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)

				    : _qp(qp)

				    , _group0_client(g0)

				    , _migration_manager(mm)

				    , _stopped(make_ready_future<>()) 

				    , _superuser(default_superuser(qp.db().get_config()))

				    , _hashing_worker(hashing_worker)

				{}

				static bool has_salted_hash(const cql3::untyped_result_set_row& row) {

				@@ -117,33 +118,95 @@ future<> password_authenticator::migrate_legacy_metadata() const {

				    });

				}

				future<> password_authenticator::create_default_if_missing() {

				future<> password_authenticator::legacy_create_default_if_missing() {

				    SCYLLA_ASSERT(legacy_mode(_qp));

				    const auto exists = co_await default_role_row_satisfies(_qp, &has_salted_hash, _superuser);

				    if (exists) {

				        co_return;

				    }

				    std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));

				    if (salted_pwd.empty()) {

				        salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt);

				        salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);

				    }

				    const auto query = update_row_query();

				    if (legacy_mode(_qp)) {

				        co_await _qp.execute_internal(

				    co_await _qp.execute_internal(

				            query,

				            db::consistency_level::QUORUM,

				            internal_distributed_query_state(),

				            {salted_pwd, _superuser},

				            cql3::query_processor::cache_internal::no);

				        plogger.info("Created default superuser authentication record.");

				    } else {

				        co_await announce_mutations(_qp, _group0_client, query,

				            {salted_pwd, _superuser}, _as, ::service::raft_timeout{});

				        plogger.info("Created default superuser authentication record.");

				    plogger.info("Created default superuser authentication record.");

				}

				future<> password_authenticator::maybe_create_default_password() {

				    auto needs_password = [this] () -> future<bool> {

				        const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);

				        auto results = co_await _qp.execute_internal(query,

				                db::consistency_level::LOCAL_ONE,

				                internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);

				        // Don't add default password if

				        // - there is no default superuser

				        // - there is a superuser with a password.

				        bool has_default = false;

				        bool has_superuser_with_password = false;

				        for (auto& result : *results) {

				            if (result.get_as<sstring>(meta::roles_table::role_col_name) == _superuser) {

				                has_default = true;

				            }

				            if (has_salted_hash(result)) {

				                has_superuser_with_password = true;

				            }

				        }

				        co_return has_default && !has_superuser_with_password;

				    };

				    if (!co_await needs_password()) {

				        co_return;

				    }

				    // We don't want to start operation earlier to avoid quorum requirement in

				    // a common case.

				    ::service::group0_batch batch(

				            co_await _group0_client.start_operation(_as, get_raft_timeout()));

				    // Check again as the state may have changed before we took the guard (batch).

				    if (!co_await needs_password()) {

				        co_return;

				    }

				    // Set default superuser's password.

				    std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));

				    if (salted_pwd.empty()) {

				        salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);

				    }

				    const auto update_query = update_row_query();

				    co_await collect_mutations(_qp, batch, update_query, {salted_pwd, _superuser});

				    co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());

				    plogger.info("Created default superuser authentication record.");

				}

				future<> password_authenticator::maybe_create_default_password_with_retries() {

				    size_t retries = _migration_manager.get_concurrent_ddl_retries();

				    while (true)  {

				        try {

				            co_return co_await maybe_create_default_password();

				        } catch (const ::service::group0_concurrent_modification& ex) {

				            plogger.warn("Failed to execute maybe_create_default_password due to guard conflict.{}.", retries ? " Retrying" : " Number of retries exceeded, giving up");

				            if (retries--) {

				                continue;

				            }

				            // Log error but don't crash the whole node startup sequence.

				            plogger.error("Failed to create default superuser password due to guard conflict.");

				            co_return;

				        } catch (const ::service::raft_operation_timeout_error& ex) {

				            plogger.error("Failed to create default superuser password due to exception: {}", ex.what());

				            co_return;

				        }

				    }

				}

				future<> password_authenticator::start() {

				    return once_among_shards([this] {

				        // Verify that at least one hashing scheme is supported.

				        passwords::detail::verify_scheme(_scheme);

				        plogger.info("Using password hashing scheme: {}", passwords::detail::prefix_for_scheme(_scheme));

				        _stopped = do_after_system_ready(_as, [this] {

				            return async([this] {

				                if (legacy_mode(_qp)) {

				@@ -164,11 +227,14 @@ future<> password_authenticator::start() {

				                        migrate_legacy_metadata().get();

				                        return;

				                    }

				                    legacy_create_default_if_missing().get();

				                }

				                utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();

				                create_default_if_missing().get();

				                if (!legacy_mode(_qp)) {

				                    _superuser_created_promise.set_value();

				                    maybe_create_default_password_with_retries().get();

				                    if (!_superuser_created_promise.available()) {

				                        _superuser_created_promise.set_value();

				                    }

				                }

				            });

				        });

				@@ -228,7 +294,13 @@ future<authenticated_user> password_authenticator::authenticate(

				    try {

				        const std::optional<sstring> salted_hash = co_await get_password_hash(username);

				        if (!salted_hash || !passwords::check(password, *salted_hash)) {

				        if (!salted_hash) {

				            throw exceptions::authentication_exception("Username and/or password are incorrect");

				        }

				        const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash = std::move(salted_hash)]{

				            return passwords::check(password, *salted_hash);

				        });

				        if (!password_match) {

				            throw exceptions::authentication_exception("Username and/or password are incorrect");

				        }

				        co_return username;

				@@ -252,7 +324,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen

				    auto maybe_hash = options.credentials.transform([&] (const auto& creds) -> sstring {

				        return std::visit(make_visitor(

				                [&] (const password_option& opt) {

				                    return passwords::hash(opt.password, rng_for_salt);

				                    return passwords::hash(opt.password, rng_for_salt, _scheme);

				                },

				                [] (const hashed_password_option& opt) {

				                    return opt.hashed_password;

				@@ -295,11 +367,11 @@ future<> password_authenticator::alter(std::string_view role_name, const authent

				                query,

				                consistency_for_user(role_name),

				                internal_distributed_query_state(),

				                {passwords::hash(password, rng_for_salt), sstring(role_name)},

				                {passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)},

				                cql3::query_processor::cache_internal::no).discard_result();

				    } else {

				        co_await collect_mutations(_qp, mc, query,

				                {passwords::hash(password, rng_for_salt), sstring(role_name)});

				                {passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});

				    }

				}

									
										14

auth/password_authenticator.hh
									
												View File
												
				@@ -15,7 +15,9 @@

				#include "db/consistency_level_type.hh"

				#include "auth/authenticator.hh"

				#include "auth/passwords.hh"

				#include "service/raft/raft_group0_client.hh"

				#include "utils/alien_worker.hh"

				namespace db {

				    class config;

				@@ -41,14 +43,17 @@ class password_authenticator : public authenticator {

				    ::service::migration_manager& _migration_manager;

				    future<> _stopped;

				    abort_source _as;

				    std::string _superuser;

				    std::string _superuser; // default superuser name from the config (may or may not be present in roles table)

				    shared_promise<> _superuser_created_promise;

				    // We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).

				    constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;

				    utils::alien_worker& _hashing_worker;

				public:

				    static db::consistency_level consistency_for_user(std::string_view role_name);

				    static std::string default_superuser(const db::config&);

				    password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);

				    password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);

				    ~password_authenticator();

				@@ -89,7 +94,10 @@ private:

				    future<> migrate_legacy_metadata() const;

				    future<> create_default_if_missing();

				    future<> legacy_create_default_if_missing();

				    future<> maybe_create_default_password();

				    future<> maybe_create_default_password_with_retries();

				    sstring update_row_query() const;

				};

									
										14

auth/passwords.cc
									
												View File
												
				@@ -21,18 +21,14 @@ static thread_local crypt_data tlcrypt = {};

				namespace detail {

				scheme identify_best_supported_scheme() {

				    const auto all_schemes = { scheme::bcrypt_y, scheme::bcrypt_a, scheme::sha_512, scheme::sha_256, scheme::md5 };

				    // "Random", for testing schemes.

				void verify_scheme(scheme scheme) {

				    const sstring random_part_of_salt = "aaaabbbbccccdddd";

				    for (scheme c : all_schemes) {

				        const sstring salt = sstring(prefix_for_scheme(c)) + random_part_of_salt;

				        const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);

				    const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;

				    const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);

				        if (e && (e[0] != '*')) {

				            return c;

				        }

				    if (e && (e[0] != '*')) {

				        return;

				    }

				    throw no_supported_schemes();

									
										20

auth/passwords.hh
									
												View File
												
				@@ -21,10 +21,11 @@ class no_supported_schemes : public std::runtime_error {

				public:

				    no_supported_schemes();

				};

				///

				/// Apache Cassandra uses a library to provide the bcrypt scheme. Many Linux implementations do not support bcrypt, so

				/// we support alternatives. The cost is loss of direct compatibility with Apache Cassandra system tables.

				/// Apache Cassandra uses a library to provide the bcrypt scheme. In ScyllaDB, we use SHA-512

				/// instead of bcrypt for performance and for historical reasons (see scylladb#24524).

				/// Currently, SHA-512 is always chosen as the hashing scheme for new passwords, but the other

				/// algorithms remain supported for CREATE ROLE WITH HASHED PASSWORD and backward compatibility.

				///

				enum class scheme {

				    bcrypt_y,

				@@ -51,11 +52,11 @@ sstring generate_random_salt_bytes(RandomNumberEngine& g) {

				}

				///

				/// Test each allowed hashing scheme and report the best supported one on the current system.

				/// Test given hashing scheme on the current system.

				///

				/// \throws \ref no_supported_schemes when none of the known schemes is supported.

				/// \throws \ref no_supported_schemes when scheme is unsupported.

				///

				scheme identify_best_supported_scheme();

				void verify_scheme(scheme scheme);

				std::string_view prefix_for_scheme(scheme) noexcept;

				@@ -67,8 +68,7 @@ std::string_view prefix_for_scheme(scheme) noexcept;

				/// \throws \ref no_supported_schemes when no known hashing schemes are supported on the system.

				///

				template <typename RandomNumberEngine>

				sstring generate_salt(RandomNumberEngine& g) {

				    static const scheme scheme = identify_best_supported_scheme();

				sstring generate_salt(RandomNumberEngine& g, scheme scheme) {

				    static const sstring prefix = sstring(prefix_for_scheme(scheme));

				    return prefix + generate_random_salt_bytes(g);

				}

				@@ -93,8 +93,8 @@ sstring hash_with_salt(const sstring& pass, const sstring& salt);

				/// \throws \ref std::system_error when the implementation-specific implementation fails to hash the cleartext.

				///

				template <typename RandomNumberEngine>

				sstring hash(const sstring& pass, RandomNumberEngine& g) {

				    return detail::hash_with_salt(pass, detail::generate_salt(g));

				sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {

				    return detail::hash_with_salt(pass, detail::generate_salt(g, scheme));

				}

				///

									
										13

auth/role_manager.hh
									
												View File
												
				@@ -17,12 +17,17 @@

				#include <seastar/core/format.hh>

				#include <seastar/core/sstring.hh>

				#include "auth/common.hh"

				#include "auth/resource.hh"

				#include "cql3/description.hh"

				#include "seastarx.hh"

				#include "exceptions/exceptions.hh"

				#include "service/raft/raft_group0_client.hh"

				namespace service {

				class query_state;

				};

				namespace auth {

				struct role_config final {

				@@ -167,9 +172,9 @@ public:

				    ///   (role2, role3)

				    /// }

				    ///  

				    virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;

				    virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& = internal_distributed_query_state()) = 0;

				    virtual future<role_set> query_all() = 0;

				    virtual future<role_set> query_all(::service::query_state& = internal_distributed_query_state()) = 0;

				    virtual future<bool> exists(std::string_view role_name) = 0;

				@@ -186,12 +191,12 @@ public:

				    ///

				    /// \returns the value of the named attribute, if one is set.

				    ///

				    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) = 0;

				    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;

				    ///

				    /// \returns a mapping of each role's value for the named attribute, if one is set for the role.

				    ///

				    virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name) = 0;

				    virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;

				    /// Sets `attribute_name` with `attribute_value` for `role_name`.

				    /// \returns an exceptional future with nonexistant_role if the role does not exist.

									
										5

auth/saslauthd_authenticator.cc
									
												View File
												
				@@ -34,9 +34,10 @@ static const class_registrator<

				        saslauthd_authenticator,

				        cql3::query_processor&,

				        ::service::raft_group0_client&,

				        ::service::migration_manager&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");

				        ::service::migration_manager&,

				        utils::alien_worker&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");

				saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&)

				saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&)

				    : _socket_path(qp.db().get_config().saslauthd_socket_path())

				{}

									
										3

auth/saslauthd_authenticator.hh
									
												View File
												
				@@ -11,6 +11,7 @@

				#pragma once

				#include "auth/authenticator.hh"

				#include "utils/alien_worker.hh"

				namespace cql3 {

				class query_processor;

				@@ -28,7 +29,7 @@ namespace auth {

				class saslauthd_authenticator : public authenticator {

				    sstring _socket_path; ///< Path to the domain socket on which saslauthd is listening.

				public:

				    saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);

				    saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, utils::alien_worker&);

				    future<> start() override;

									
										14

auth/service.cc
									
												View File
												
				@@ -187,14 +187,15 @@ service::service(

				        ::service::migration_notifier& mn,

				        ::service::migration_manager& mm,

				        const service_config& sc,

				        maintenance_socket_enabled used_by_maintenance_socket)

				        maintenance_socket_enabled used_by_maintenance_socket,

				        utils::alien_worker& hashing_worker)

				            : service(

				                      std::move(c),

				                      qp,

				                      g0,

				                      mn,

				                      create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),

				                      create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm),

				                      create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, hashing_worker),

				                      create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm),

				                      used_by_maintenance_socket) {

				}

				@@ -240,6 +241,13 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s

				        });

				    }

				    co_await _role_manager->start();

				    if (this_shard_id() == 0) {

				        // Role manager and password authenticator have this odd startup

				        // mechanism where they asynchronously create the superuser role

				        // in the background. Correct password creation depends on role

				        // creation therefore we need to wait here.

				        co_await _role_manager->ensure_superuser_is_created();

				    }

				    co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();

				    _permissions_cache = std::make_unique<permissions_cache>(_loading_cache_config, *this, log);

				    co_await once_among_shards([this] {

				@@ -885,7 +893,7 @@ future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_

				                for (const auto& col : schema->all_columns()) {

				                    if (row.has(col.name_as_text())) {

				                        values.push_back(

				                                col.type->deserialize(row.get_blob(col.name_as_text())));

				                                col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));

				                    } else {

				                        values.push_back(unset_value{});

				                    }

									
										4

auth/service.hh
									
												View File
												
				@@ -26,6 +26,7 @@

				#include "cql3/description.hh"

				#include "seastarx.hh"

				#include "service/raft/raft_group0_client.hh"

				#include "utils/alien_worker.hh"

				#include "utils/observable.hh"

				#include "utils/serialized_action.hh"

				#include "service/maintenance_mode.hh"

				@@ -126,7 +127,8 @@ public:

				            ::service::migration_notifier&,

				            ::service::migration_manager&,

				            const service_config&,

				            maintenance_socket_enabled);

				            maintenance_socket_enabled,

				            utils::alien_worker&);

				    future<> start(::service::migration_manager&, db::system_keyspace&);

									
										129

auth/standard_role_manager.cc
									
												View File
												
				@@ -9,6 +9,7 @@

				#include "auth/standard_role_manager.hh"

				#include <optional>

				#include <stdexcept>

				#include <unordered_set>

				#include <vector>

				@@ -28,6 +29,7 @@

				#include "cql3/util.hh"

				#include "db/consistency_level_type.hh"

				#include "exceptions/exceptions.hh"

				#include "utils/error_injection.hh"

				#include "utils/log.hh"

				#include <seastar/core/loop.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				@@ -126,7 +128,7 @@ static future<record> require_record(cql3::query_processor& qp, std::string_view

				}

				static bool has_can_login(const cql3::untyped_result_set_row& row) {

				    return row.has("can_login") && !(boolean_type->deserialize(row.get_blob("can_login")).is_null());

				    return row.has("can_login") && !(boolean_type->deserialize(row.get_blob_unfragmented("can_login")).is_null());

				}

				standard_role_manager::standard_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)

				@@ -178,7 +180,8 @@ future<> standard_role_manager::create_legacy_metadata_tables_if_missing() const

				                    _migration_manager)).discard_result();

				}

				future<> standard_role_manager::create_default_role_if_missing() {

				future<> standard_role_manager::legacy_create_default_role_if_missing() {

				    SCYLLA_ASSERT(legacy_mode(_qp));

				    try {

				        const auto exists = co_await default_role_row_satisfies(_qp, &has_can_login, _superuser);

				        if (exists) {

				@@ -188,16 +191,12 @@ future<> standard_role_manager::create_default_role_if_missing() {

				                get_auth_ks_name(_qp),

				                meta::roles_table::name,

				                meta::roles_table::role_col_name);

				        if (legacy_mode(_qp)) {

				            co_await _qp.execute_internal(

				                    query,

				                    db::consistency_level::QUORUM,

				                    internal_distributed_query_state(),

				                    {_superuser},

				                    cql3::query_processor::cache_internal::no).discard_result();

				        } else {

				            co_await announce_mutations(_qp, _group0_client, query, {_superuser}, _as, ::service::raft_timeout{});

				        }

				        co_await _qp.execute_internal(

				                query,

				                db::consistency_level::QUORUM,

				                internal_distributed_query_state(),

				                {_superuser},

				                cql3::query_processor::cache_internal::no).discard_result();

				        log.info("Created default superuser role '{}'.", _superuser);

				    } catch(const exceptions::unavailable_exception& e) {

				        log.warn("Skipped default role setup: some nodes were not ready; will retry");

				@@ -205,6 +204,60 @@ future<> standard_role_manager::create_default_role_if_missing() {

				    }

				}

				future<> standard_role_manager::maybe_create_default_role() {

				    auto has_superuser = [this] () -> future<bool> {

				        const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);

				        auto results = co_await _qp.execute_internal(query, db::consistency_level::LOCAL_ONE,

				                internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);

				        for (const auto& result : *results) {

				            if (has_can_login(result)) {

				                co_return true;

				            }

				        }

				        co_return false;

				    };

				    if (co_await has_superuser()) {

				        co_return;

				    }

				    // We don't want to start operation earlier to avoid quorum requirement in

				    // a common case.

				    ::service::group0_batch batch(

				            co_await _group0_client.start_operation(_as, get_raft_timeout()));

				    // Check again as the state may have changed before we took the guard (batch).

				    if (co_await has_superuser()) {

				        co_return;

				    }

				    // There is no superuser which has can_login field - create default role.

				    // Note that we don't check if can_login is set to true.

				    const sstring insert_query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",

				            get_auth_ks_name(_qp),

				            meta::roles_table::name,

				            meta::roles_table::role_col_name);

				    co_await collect_mutations(_qp, batch, insert_query, {_superuser});

				    co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());

				    log.info("Created default superuser role '{}'.", _superuser);

				}

				future<> standard_role_manager::maybe_create_default_role_with_retries() {

				    size_t retries = _migration_manager.get_concurrent_ddl_retries();

				    while (true)  {

				        try {

				            co_return co_await maybe_create_default_role();

				        } catch (const ::service::group0_concurrent_modification& ex) {

				            log.warn("Failed to execute maybe_create_default_role due to guard conflict.{}.", retries ? " Retrying" : " Number of retries exceeded, giving up");

				            if (retries--) {

				                continue;

				            }

				            // Log error but don't crash the whole node startup sequence.

				            log.error("Failed to create default superuser role due to guard conflict.");

				            co_return;

				        } catch (const ::service::raft_operation_timeout_error& ex) {

				            log.error("Failed to create default superuser role due to exception: {}", ex.what());

				            co_return;

				        }

				    }

				}

				static const sstring legacy_table_name{"users"};

				bool standard_role_manager::legacy_metadata_exists() {

				@@ -266,10 +319,13 @@ future<> standard_role_manager::start() {

				                    co_await migrate_legacy_metadata();

				                    co_return;

				                }

				                co_await legacy_create_default_role_if_missing();

				            }

				            co_await create_default_role_if_missing();

				            if (!legacy) {

				                _superuser_created_promise.set_value();

				                co_await maybe_create_default_role_with_retries();

				                if (!_superuser_created_promise.available()) {

				                    _superuser_created_promise.set_value();

				                }

				            }

				        };

				@@ -596,21 +652,30 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n

				    });

				}

				future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {

				future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {

				    const sstring query = seastar::format("SELECT * FROM {}.{}",

				            get_auth_ks_name(_qp),

				            meta::role_members_table::name);

				    const auto results = co_await _qp.execute_internal(

				            query,

				            db::consistency_level::ONE,

				            qs,

				            cql3::query_processor::cache_internal::yes);

				    role_to_directly_granted_map roles_map;

				    co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {

				        roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});

				        co_return stop_iteration::no;

				    });

				    std::transform(

				            results->begin(),

				            results->end(),

				            std::inserter(roles_map, roles_map.begin()),

				            [] (const cql3::untyped_result_set_row& row) {

				                return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }

				    );

				    co_return roles_map;

				}

				future<role_set> standard_role_manager::query_all() {

				future<role_set> standard_role_manager::query_all(::service::query_state& qs) {

				    const sstring query = seastar::format("SELECT {} FROM {}.{}",

				            meta::roles_table::role_col_name,

				            get_auth_ks_name(_qp),

				@@ -619,10 +684,16 @@ future<role_set> standard_role_manager::query_all() {

				    // To avoid many copies of a view.

				    static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);

				    if (utils::get_local_injector().enter("standard_role_manager_fail_legacy_query")) {

				        if (legacy_mode(_qp)) {

				            throw std::runtime_error("standard_role_manager::query_all: failed due to error injection");

				        }

				    }

				    const auto results = co_await _qp.execute_internal(

				            query,

				            db::consistency_level::QUORUM,

				            internal_distributed_query_state(),

				            qs,

				            cql3::query_processor::cache_internal::yes);

				    role_set roles;

				@@ -654,11 +725,11 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {

				    });

				}

				future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {

				future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {

				    const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",

				            get_auth_ks_name(_qp),

				            meta::role_attributes_table::name);

				    const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);

				    const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);

				    if (!result_set->empty()) {

				        const cql3::untyped_result_set_row &row = result_set->one();

				        co_return std::optional<sstring>(row.get_as<sstring>("value"));

				@@ -666,11 +737,11 @@ future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_

				    co_return std::optional<sstring>{};

				}

				future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name) {

				    return query_all().then([this, attribute_name] (role_set roles) {

				        return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles)] (attribute_vals &role_to_att_val) {

				            return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name] (sstring role) {

				                return get_attribute(role, attribute_name).then([&role_to_att_val, role] (std::optional<sstring> att_val) {

				future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {

				    return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {

				        return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {

				            return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {

				                return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {

				                    if (att_val) {

				                        role_to_att_val.emplace(std::move(role), std::move(*att_val));

				                    }

				@@ -715,7 +786,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std

				future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {

				    std::vector<cql3::description> result{};

				    const auto grants = co_await query_all_directly_granted();

				    const auto grants = co_await query_all_directly_granted(internal_distributed_query_state());

				    result.reserve(grants.size());

				    for (const auto& [grantee_role, granted_role] : grants) {

									
										13

auth/standard_role_manager.hh
									
												View File
												
				@@ -66,9 +66,9 @@ public:

				    virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;

				    virtual future<role_to_directly_granted_map> query_all_directly_granted() override;

				    virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;

				    virtual future<role_set> query_all() override;

				    virtual future<role_set> query_all(::service::query_state&) override;

				    virtual future<bool> exists(std::string_view role_name) override;

				@@ -76,9 +76,9 @@ public:

				    virtual future<bool> can_login(std::string_view role_name) override;

				    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;

				    virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;

				    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;

				    virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;

				    virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

				@@ -95,7 +95,10 @@ private:

				    future<> migrate_legacy_metadata();

				    future<> create_default_role_if_missing();

				    future<> legacy_create_default_role_if_missing();

				    future<> maybe_create_default_role();

				    future<> maybe_create_default_role_with_retries();

				    future<> create_or_replace(std::string_view role_name, const role_config&, ::service::group0_batch&);

									
										7

auth/transitional.cc
									
												View File
												
				@@ -37,8 +37,8 @@ class transitional_authenticator : public authenticator {

				public:

				    static const sstring PASSWORD_AUTHENTICATOR_NAME;

				    transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)

				            : transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm)) {

				    transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, utils::alien_worker& hashing_worker)

				            : transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, hashing_worker)) {

				    }

				    transitional_authenticator(std::unique_ptr<authenticator> a)

				            : _authenticator(std::move(a)) {

				@@ -239,7 +239,8 @@ static const class_registrator<

				        auth::transitional_authenticator,

				        cql3::query_processor&,

				        ::service::raft_group0_client&,

				        ::service::migration_manager&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");

				        ::service::migration_manager&,

				        utils::alien_worker&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");

				static const class_registrator<

				        auth::authorizer,

									
										5

bytes.hh
									
												View File
												
				@@ -35,8 +35,9 @@ inline bytes_view to_bytes_view(std::string_view view) {

				}

				struct fmt_hex {

				    const bytes_view& v;

				    fmt_hex(const bytes_view& v) noexcept : v(v) {}

				    std::span<const std::byte> v;

				    fmt_hex(const bytes_view& v) noexcept : v(std::as_bytes(std::span(v))) {}

				    fmt_hex(std::span<const std::byte> v) noexcept : v(v) {}

				};

				bytes from_hex(std::string_view s);

									
										5

cdc/cdc_extension.hh
									
												View File
												
				@@ -23,6 +23,10 @@ class cdc_extension : public schema_extension {

				public:

				    static constexpr auto NAME = "cdc";

				    // cdc_extension was written before schema_extension was deprecated, so support it

				    // without warnings

				#pragma clang diagnostic push

				#pragma clang diagnostic ignored "-Wdeprecated-declarations"

				    cdc_extension() = default;

				    cdc_extension(const options& opts) : _cdc_options(opts) {}

				    explicit cdc_extension(std::map<sstring, sstring> tags) : _cdc_options(std::move(tags)) {}

				@@ -30,6 +34,7 @@ public:

				    explicit cdc_extension(const sstring& s) {

				        throw std::logic_error("Cannot create cdc info from string");

				    }

				#pragma clang diagnostic pop

				    bytes serialize() const override {

				        return ser::serialize_to_buffer<bytes>(_cdc_options.to_map());

				    }

									
										31

cdc/generation.cc
									
												View File
												
				@@ -39,12 +39,12 @@

				extern logging::logger cdc_log;

				static int get_shard_count(const gms::inet_address& endpoint, const gms::gossiper& g) {

				static int get_shard_count(const locator::host_id& endpoint, const gms::gossiper& g) {

				    auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::SHARD_COUNT);

				    return ep_state ? std::stoi(ep_state->value()) : -1;

				}

				static unsigned get_sharding_ignore_msb(const gms::inet_address& endpoint, const gms::gossiper& g) {

				static unsigned get_sharding_ignore_msb(const locator::host_id& endpoint, const gms::gossiper& g) {

				    auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::IGNORE_MSB_BITS);

				    return ep_state ? std::stoi(ep_state->value()) : 0;

				}

				@@ -198,7 +198,7 @@ static std::vector<stream_id> create_stream_ids(

				}

				bool should_propose_first_generation(const locator::host_id& my_host_id, const gms::gossiper& g) {

				    return g.for_each_endpoint_state_until([&] (const gms::inet_address&, const gms::endpoint_state& eps) {

				    return g.for_each_endpoint_state_until([&] (const gms::endpoint_state& eps) {

				        return stop_iteration(my_host_id < eps.get_host_id());

				    }) == stop_iteration::no;

				}

				@@ -365,6 +365,9 @@ cdc::topology_description make_new_generation_description(

				        const noncopyable_function<std::pair<size_t, uint8_t>(dht::token)>& get_sharding_info,

				        const locator::token_metadata_ptr tmptr) {

				    const auto tokens = get_tokens(bootstrap_tokens, tmptr);

				    if (tokens.empty()) {

				        on_internal_error(cdc_log, "Attempted to create a CDC generation from an empty list of tokens");

				    }

				    utils::chunked_vector<token_range_description> vnode_descriptions;

				    vnode_descriptions.reserve(tokens.size());

				@@ -402,9 +405,8 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const

				                throw std::runtime_error(

				                        format("Can't find endpoint for token {}", end));

				            }

				            const auto ep = _gossiper.get_address_map().get(*endpoint);

				            auto sc = get_shard_count(ep, _gossiper);

				            return {sc > 0 ? sc : 1, get_sharding_ignore_msb(ep, _gossiper)};

				            auto sc = get_shard_count(*endpoint, _gossiper);

				            return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};

				        }

				    };

				@@ -463,7 +465,7 @@ future<cdc::generation_id> generation_service::legacy_make_new_generation(const

				 * but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,

				 * which means it will gossip the generation's timestamp.

				 */

				static std::optional<cdc::generation_id> get_generation_id_for(const gms::inet_address& endpoint, const gms::endpoint_state& eps) {

				static std::optional<cdc::generation_id> get_generation_id_for(const locator::host_id& endpoint, const gms::endpoint_state& eps) {

				    const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);

				    if (!gen_id_ptr) {

				        return std::nullopt;

				@@ -841,18 +843,18 @@ future<> generation_service::leave_ring() {

				    co_await _gossiper.unregister_(shared_from_this());

				}

				future<> generation_service::on_join(gms::inet_address ep, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {

				    return on_change(ep, ep_state->get_application_state_map(), pid);

				future<> generation_service::on_join(gms::inet_address ep, locator::host_id id, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {

				    return on_change(ep, id, ep_state->get_application_state_map(), pid);

				}

				future<> generation_service::on_change(gms::inet_address ep, const gms::application_state_map& states, gms::permit_id pid) {

				future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    if (_raft_topology_change_enabled()) {

				        return make_ready_future<>();

				    }

				    return on_application_state_change(ep, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, const gms::versioned_value& v, gms::permit_id) {

				    return on_application_state_change(ep, id, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, locator::host_id id, const gms::versioned_value& v, gms::permit_id) {

				        auto gen_id = gms::versioned_value::cdc_generation_id_from_string(v.value());

				        cdc_log.debug("Endpoint: {}, CDC generation ID change: {}", ep, gen_id);

				@@ -867,7 +869,8 @@ future<> generation_service::check_and_repair_cdc_streams() {

				    }

				    std::optional<cdc::generation_id> latest = _gen_id;

				    _gossiper.for_each_endpoint_state([&] (const gms::inet_address& addr, const gms::endpoint_state& state) {

				    _gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& state) {

				        auto addr = state.get_host_id();

				        if (_gossiper.is_left(addr)) {

				            cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);

				            return;

				@@ -1066,8 +1069,8 @@ future<> generation_service::legacy_scan_cdc_generations() {

				    assert_shard_zero(__PRETTY_FUNCTION__);

				    std::optional<cdc::generation_id> latest;

				    _gossiper.for_each_endpoint_state([&] (const gms::inet_address& node, const gms::endpoint_state& eps) {

				        auto gen_id = get_generation_id_for(node, eps);

				    _gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {

				        auto gen_id = get_generation_id_for(eps.get_host_id(), eps);

				        if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {

				            latest = gen_id;

				        }

									
										9

cdc/generation_service.hh
									
												View File
												
				@@ -110,13 +110,8 @@ public:

				        return _cdc_metadata;

				    }

				    virtual future<> on_alive(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_dead(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_remove(gms::inet_address, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_restart(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override { return make_ready_future(); }

				    virtual future<> on_join(gms::inet_address, gms::endpoint_state_ptr, gms::permit_id) override;

				    virtual future<> on_change(gms::inet_address, const gms::application_state_map&, gms::permit_id) override;

				    virtual future<> on_join(gms::inet_address, locator::host_id id, gms::endpoint_state_ptr, gms::permit_id) override;

				    virtual future<> on_change(gms::inet_address, locator::host_id id, const gms::application_state_map&, gms::permit_id) override;

				    future<> check_and_repair_cdc_streams();

									
										76

cdc/log.cc
									
												View File
												
				@@ -56,8 +56,17 @@ using namespace std::chrono_literals;

				logging::logger cdc_log("cdc");

				namespace {

				// When dropping a column from a CDC log table, we set the drop timestamp

				// `column_drop_leeway` seconds into the future to ensure that for writes concurrent

				// with column drop, the write timestamp is before the column drop timestamp.

				constexpr auto column_drop_leeway = std::chrono::seconds(5);

				} // anonymous namespace

				namespace cdc {

				static schema_ptr create_log_schema(const schema&, std::optional<table_id> = {}, schema_ptr = nullptr);

				static schema_ptr create_log_schema(const schema&, api::timestamp_type, std::optional<table_id> = {}, schema_ptr = nullptr);

				}

				static constexpr auto cdc_group_name = "cdc";

				@@ -167,7 +176,7 @@ public:

				            ensure_that_table_uses_vnodes(ksm, schema);

				            // in seastar thread

				            auto log_schema = create_log_schema(schema);

				            auto log_schema = create_log_schema(schema, timestamp);

				            auto log_mut = db::schema_tables::make_create_table_mutations(log_schema, timestamp);

				@@ -205,7 +214,7 @@ public:

				            ensure_that_table_has_no_counter_columns(new_schema);

				            ensure_that_table_uses_vnodes(*keyspace.metadata(), new_schema);

				            auto new_log_schema = create_log_schema(new_schema, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);

				            auto new_log_schema = create_log_schema(new_schema, timestamp, log_schema ? std::make_optional(log_schema->id()) : std::nullopt, log_schema);

				            auto log_mut = log_schema 

				                ? db::schema_tables::make_update_table_mutations(db, keyspace.metadata(), log_schema, new_log_schema, timestamp)

				@@ -496,7 +505,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {

				    return to_bytes(cdc_deleted_elements_column_prefix) + column_name;

				}

				static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uuid, schema_ptr old) {

				static schema_ptr create_log_schema(const schema& s, api::timestamp_type timestamp, std::optional<table_id> uuid, schema_ptr old) {

				    schema_builder b(s.ks_name(), log_name(s.cf_name()));

				    b.with_partitioner(cdc::cdc_partitioner::classname);

				    b.set_compaction_strategy(sstables::compaction_strategy_type::time_window);

				@@ -531,6 +540,28 @@ static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uui

				    b.with_column(log_meta_column_name_bytes("ttl"), long_type);

				    b.with_column(log_meta_column_name_bytes("end_of_batch"), boolean_type);

				    b.set_caching_options(caching_options::get_disabled_caching_options());

				    auto validate_new_column = [&] (const sstring& name) {

				        // When dropping a column from a CDC log table, we set the drop timestamp to be

				        // `column_drop_leeway` seconds into the future (see `create_log_schema`).

				        // Therefore, when recreating a column with the same name, we need to validate

				        // that it's not recreated too soon and that the drop timestamp has passed.

				        if (old && old->dropped_columns().contains(name)) {

				            const auto& drop_info = old->dropped_columns().at(name);

				            auto create_time = api::timestamp_clock::time_point(api::timestamp_clock::duration(timestamp));

				            auto drop_time = api::timestamp_clock::time_point(api::timestamp_clock::duration(drop_info.timestamp));

				            if (drop_time > create_time) {

				                throw exceptions::invalid_request_exception(format("Cannot add column {} because a column with the same name was dropped too recently. Please retry after {} seconds",

				                        name, std::chrono::duration_cast<std::chrono::seconds>(drop_time - create_time).count() + 1));

				            }

				        }

				    };

				    auto add_column = [&] (sstring name, data_type type) {

				        validate_new_column(name);

				        b.with_column(to_bytes(name), type);

				    };

				    auto add_columns = [&] (const schema::const_iterator_range_type& columns, bool is_data_col = false) {

				        for (const auto& column : columns) {

				            auto type = column.type;

				@@ -552,9 +583,9 @@ static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uui

				                    }

				                ));

				            }

				            b.with_column(log_data_column_name_bytes(column.name()), type);

				            add_column(log_data_column_name(column.name_as_text()), type);

				            if (is_data_col) {

				                b.with_column(log_data_column_deleted_name_bytes(column.name()), boolean_type);

				                add_column(log_data_column_deleted_name(column.name_as_text()), boolean_type);

				            }

				            if (column.type->is_multi_cell()) {

				                auto dtype = visit(*type, make_visitor(

				@@ -570,7 +601,7 @@ static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uui

				                        throw std::invalid_argument("Should not reach");

				                    }

				                ));

				                b.with_column(log_data_column_deleted_elements_name_bytes(column.name()), dtype);

				                add_column(log_data_column_deleted_elements_name(column.name_as_text()), dtype);

				            }

				        }

				    };

				@@ -592,7 +623,8 @@ static schema_ptr create_log_schema(const schema& s, std::optional<table_id> uui

				        // not super efficient, but we don't do this often.

				        for (auto& col : old->all_columns()) {

				            if (!b.has_column({col.name(), col.name_as_text() })) {

				                b.without_column(col.name_as_text(), col.type, api::new_timestamp());

				                auto drop_ts = api::timestamp_clock::now() + column_drop_leeway;

				                b.without_column(col.name_as_text(), col.type, drop_ts.time_since_epoch().count());

				            }

				        }

				    }

				@@ -960,8 +992,12 @@ public:

				    // Given a reference to such a column from the base schema, this function sets the corresponding column

				    // in the log to the given value for the given row.

				    void set_value(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes_view& value) {

				        auto& log_cdef = *_log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));

				        _log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));

				        auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));

				        if (!log_cdef_ptr) {

				            throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",

				                _log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));

				        }

				        _log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));

				    }

				    // Each regular and static column in the base schema has a corresponding column in the log schema

				@@ -969,7 +1005,13 @@ public:

				    // Given a reference to such a column from the base schema, this function sets the corresponding column

				    // in the log to `true` for the given row. If not called, the column will be `null`.

				    void set_deleted(const clustering_key& log_ck, const column_definition& base_cdef) {

				        _log_mut.set_cell(log_ck, log_data_column_deleted_name_bytes(base_cdef.name()), data_value(true), _ts, _ttl);

				        auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_name_bytes(base_cdef.name()));

				        if (!log_cdef_ptr) {

				            throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",

				                _log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));

				        }

				        auto& log_cdef = *log_cdef_ptr;

				        _log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*log_cdef.type, _ts, log_cdef.type->decompose(true), _ttl));

				    }

				    // Each regular and static non-atomic column in the base schema has a corresponding column in the log schema

				@@ -978,7 +1020,12 @@ public:

				    // Given a reference to such a column from the base schema, this function sets the corresponding column

				    // in the log to the given set of keys for the given row.

				    void set_deleted_elements(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes& deleted_elements) {

				        auto& log_cdef = *_log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));

				        auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));

				        if (!log_cdef_ptr) {

				            throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",

				                _log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));

				        }

				        auto& log_cdef = *log_cdef_ptr;

				        _log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*log_cdef.type, _ts, deleted_elements, _ttl));

				    }

				@@ -1865,5 +1912,10 @@ bool cdc::cdc_service::needs_cdc_augmentation(const std::vector<mutation>& mutat

				future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>

				cdc::cdc_service::augment_mutation_call(lowres_clock::time_point timeout, std::vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {

				    if (utils::get_local_injector().enter("sleep_before_cdc_augmentation")) {

				        return seastar::sleep(std::chrono::milliseconds(100)).then([this, timeout, mutations = std::move(mutations), tr_state = std::move(tr_state), write_cl] () mutable {

				            return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);

				        });

				    }

				    return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);

				}

									
										2

cmake/mode.Coverage.cmake
									
												View File
												
				@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_COVERAGE

				  CACHE

				  INTERNAL

				  "")

				update_cxx_flags(CMAKE_CXX_FLAGS_COVERAGE

				update_build_flags(Coverage

				  WITH_DEBUG_INFO

				  OPTIMIZATION_LEVEL "g")

									
										2

cmake/mode.Debug.cmake
									
												View File
												
				@@ -1,6 +1,6 @@

				set(OptimizationLevel "g")

				update_cxx_flags(CMAKE_CXX_FLAGS_DEBUG

				update_build_flags(Debug

				  WITH_DEBUG_INFO

				  OPTIMIZATION_LEVEL ${OptimizationLevel})

									
										2

cmake/mode.Dev.cmake
									
												View File
												
				@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_DEV

				  CACHE

				  INTERNAL

				  "")

				update_cxx_flags(CMAKE_CXX_FLAGS_DEV

				update_build_flags(Dev

				  OPTIMIZATION_LEVEL "2")

				set(scylla_build_mode_Dev "dev")

									
										2

cmake/mode.RelWithDebInfo.cmake
									
												View File
												
				@@ -8,7 +8,7 @@ set(CMAKE_CXX_FLAGS_RELWITHDEBINFO

				  CACHE

				  INTERNAL

				  "")

				update_cxx_flags(CMAKE_CXX_FLAGS_RELWITHDEBINFO

				update_build_flags(RelWithDebInfo

				  WITH_DEBUG_INFO

				  OPTIMIZATION_LEVEL "3")

									
										2

cmake/mode.Sanitize.cmake
									
												View File
												
				@@ -3,7 +3,7 @@ set(CMAKE_CXX_FLAGS_SANITIZE

				  CACHE

				  INTERNAL

				  "")

				update_cxx_flags(CMAKE_CXX_FLAGS_SANITIZE

				update_build_flags(Sanitize

				  WITH_DEBUG_INFO

				  OPTIMIZATION_LEVEL "s")

									
										23

cmake/mode.common.cmake
									
												View File
												
				@@ -72,7 +72,7 @@ function(get_padded_dynamic_linker_option output length)

				    ERROR_VARIABLE driver_command_line

				    ERROR_STRIP_TRAILING_WHITESPACE)

				  # extract the argument for the "-dynamic-linker" option

				  if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"? \"?([^ \"]*)\"? .*")

				  if(driver_command_line MATCHES ".*\"?${dynamic_linker_option}\"?[ =]\"?([^ \"]*)\"?[ \n].*")

				    set(dynamic_linker ${CMAKE_MATCH_1})

				  else()

				    message(FATAL_ERROR "Unable to find ${dynamic_linker_option} in driver-generated command: "

				@@ -80,7 +80,7 @@ function(get_padded_dynamic_linker_option output length)

				  endif()

				  # prefixing a path with "/"s does not actually change it means

				  pad_at_begin(padded_dynamic_linker "/" "${dynamic_linker}" ${length})

				  set(${output} "${dynamic_linker_option}=${padded_dynamic_linker}" PARENT_SCOPE)

				  set(${output} "--dynamic-linker=${padded_dynamic_linker}" PARENT_SCOPE)

				endfunction()

				# We want to strip the absolute build paths from the binary,

				@@ -135,7 +135,7 @@ function(maybe_limit_stack_usage_in_KB stack_usage_threshold_in_KB config)

				  endif()

				endfunction()

				macro(update_cxx_flags flags)

				macro(update_build_flags config)

				  cmake_parse_arguments (

				    parsed_args

				    "WITH_DEBUG_INFO"

				@@ -145,11 +145,22 @@ macro(update_cxx_flags flags)

				  if(NOT DEFINED parsed_args_OPTIMIZATION_LEVEL)

				    message(FATAL_ERROR "OPTIMIZATION_LEVEL is missing")

				  endif()

				  string(APPEND ${flags}

				  string(TOUPPER ${config} CONFIG)

				  set(cxx_flags "CMAKE_CXX_FLAGS_${CONFIG}")

				  set(linker_flags "CMAKE_EXE_LINKER_FLAGS_${CONFIG}")

				  string(APPEND ${cxx_flags}

				    " -O${parsed_args_OPTIMIZATION_LEVEL}")

				  if(parsed_args_WITH_DEBUG_INFO)

				    string(APPEND ${flags} " -g -gz")

				    string(APPEND ${cxx_flags} " -g -gz")

				  else()

				    # If Scylla is compiled without debug info, strip the debug symbols from

				    # the result in case one of the linked static libraries happens to have

				    # some debug symbols. See issue #23834.

				    string(APPEND ${linker_flags} " -Wl,--strip-debug")

				  endif()

				  unset(CONFIG)

				  unset(cxx_flags)

				  unset(linker_flags)

				endmacro()

				set(pgo_opts "")

				@@ -283,7 +294,7 @@ else()

				  # that. The 512 includes the null at the end, hence the 511 below.

				  get_padded_dynamic_linker_option(dynamic_linker_option 511)

				endif()

				add_link_options("${dynamic_linker_option}")

				add_link_options("LINKER:${dynamic_linker_option}")

				if(Scylla_ENABLE_LTO)

				  include(CheckIPOSupported)

									
										61

compaction/compaction.cc
									
												View File
												
				@@ -135,20 +135,21 @@ std::string_view to_string(compaction_type_options::scrub::quarantine_mode quara

				    return "(invalid)";

				}

				static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,

				static max_purgeable get_max_purgeable_timestamp(const table_state& table_s, sstable_set::incremental_selector& selector,

				        const std::unordered_set<shared_sstable>& compacting_set, const dht::decorated_key& dk, uint64_t& bloom_filter_checks,

				        const api::timestamp_type compacting_max_timestamp, const bool gc_check_only_compacting_sstables, const is_shadowable is_shadowable) {

				    if (!table_s.tombstone_gc_enabled()) [[unlikely]] {

				        return api::min_timestamp;

				        return { .timestamp = api::min_timestamp };

				    }

				    auto timestamp = api::max_timestamp;

				    if (gc_check_only_compacting_sstables) {

				        // If gc_check_only_compacting_sstables is enabled, do not

				        // check memtables and other sstables not being compacted.

				        return timestamp;

				        return { .timestamp = timestamp };

				    }

				    auto source = max_purgeable::timestamp_source::none;

				    api::timestamp_type memtable_min_timestamp;

				    if (is_shadowable) {

				        // For shadowable tombstones, check the minimum live row_marker timestamp

				@@ -174,6 +175,7 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_

				    // newer data.

				    if (memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {

				        timestamp = memtable_min_timestamp;

				        source = max_purgeable::timestamp_source::memtable_possibly_shadowing_data;

				    }

				    std::optional<utils::hashed_key> hk;

				    for (auto&& sst : boost::range::join(selector.select(dk).sstables, table_s.compacted_undeleted_sstables())) {

				@@ -217,9 +219,10 @@ static api::timestamp_type get_max_purgeable_timestamp(const table_state& table_

				        if (sst->filter_has_key(*hk)) {

				            bloom_filter_checks++;

				            timestamp = min_timestamp;

				            source = max_purgeable::timestamp_source::other_sstables_possibly_shadowing_data;

				        }

				    }

				    return timestamp;

				    return { .timestamp = timestamp, .source = source };

				}

				static std::vector<shared_sstable> get_uncompacting_sstables(const table_state& table_s, std::vector<shared_sstable> sstables) {

				@@ -235,6 +238,12 @@ static std::vector<shared_sstable> get_uncompacting_sstables(const table_state&

				    return not_compacted_sstables;

				}

				static std::vector<basic_info> extract_basic_info_from_sstables(const std::vector<shared_sstable>& sstables) {

				    return sstables | std::views::transform([] (auto&& sst) {

				        return sstables::basic_info{.generation = sst->generation(), .origin = sst->get_origin(), .size = sst->bytes_on_disk()};

				    }) | std::ranges::to<std::vector<basic_info>>();

				}

				class compaction;

				class compaction_write_monitor final : public sstables::write_monitor, public backlog_write_progress_manager {

				@@ -483,6 +492,7 @@ protected:

				    const reader_permit _permit;

				    std::vector<shared_sstable> _sstables;

				    std::vector<generation_type> _input_sstable_generations;

				    std::vector<basic_info> _input_sstables_basic_info;

				    // Unused sstables are tracked because if compaction is interrupted we can only delete them.

				    // Deleting used sstables could potentially result in data loss.

				    std::unordered_set<shared_sstable> _new_partial_sstables;

				@@ -501,6 +511,7 @@ protected:

				    double _estimated_droppable_tombstone_ratio = 0;

				    uint64_t _bloom_filter_checks = 0;

				    combined_reader_statistics _reader_statistics;

				    tombstone_purge_stats _tombstone_purge_stats;

				    db::replay_position _rp;

				    encoding_stats_collector _stats_collector;

				    const bool _can_split_large_partition = false;

				@@ -762,14 +773,14 @@ private:

				            return dht::to_partition_range(*r);

				        };

				        return make_flat_multi_range_reader(_schema, _permit, std::move(source),

				        return make_multi_range_reader(_schema, _permit, std::move(source),

				                                            std::move(owned_range_generator),

				                                            _schema->full_slice(),

				                                            tracing::trace_state_ptr());

				    }

				    virtual sstables::sstable_set make_sstable_set_for_input() const {

				        return _table_s.get_compaction_strategy().make_sstable_set(_schema);

				        return _table_s.get_compaction_strategy().make_sstable_set(_table_s);

				    }

				    const tombstone_gc_state& get_tombstone_gc_state() const {

				@@ -783,12 +794,15 @@ private:

				        double sum_of_estimated_droppable_tombstone_ratio = 0;

				        _input_sstable_generations.reserve(_sstables.size());

				        _input_sstables_basic_info.reserve(_sstables.size());

				        for (auto& sst : _sstables) {

				            co_await coroutine::maybe_yield();

				            auto& sst_stats = sst->get_stats_metadata();

				            timestamp_tracker.update(sst_stats.min_timestamp);

				            timestamp_tracker.update(sst_stats.max_timestamp);

				            _input_sstables_basic_info.emplace_back(sst->generation(), sst->get_origin(), sst->bytes_on_disk());

				            // Compacted sstable keeps track of its ancestors.

				            _input_sstable_generations.push_back(sst->generation());

				            _start_size += sst->bytes_on_disk();

				@@ -842,7 +856,8 @@ private:

				            });

				        });

				        const auto& gc_state = get_tombstone_gc_state();

				        return consumer(make_compacting_reader(setup_sstable_reader(), compaction_time, max_purgeable_func(), gc_state));

				        return consumer(make_compacting_reader(setup_sstable_reader(), compaction_time, max_purgeable_func(), gc_state,

				                                               streamed_mutation::forwarding::no, &_tombstone_purge_stats));

				    }

				    future<> consume() {

				@@ -859,22 +874,24 @@ private:

				                auto close_reader = deferred_close(reader);

				                if (enable_garbage_collected_sstable_writer()) {

				                    using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, compacted_fragments_writer>;

				                    using compact_mutations = compact_for_compaction<compacted_fragments_writer, compacted_fragments_writer>;

				                    auto cfc = compact_mutations(*schema(), now,

				                        max_purgeable_func(),

				                        get_tombstone_gc_state(),

				                        get_compacted_fragments_writer(),

				                        get_gc_compacted_fragments_writer());

				                        get_gc_compacted_fragments_writer(),

				                        &_tombstone_purge_stats);

				                    reader.consume_in_thread(std::move(cfc));

				                    return;

				                }

				                using compact_mutations = compact_for_compaction_v2<compacted_fragments_writer, noop_compacted_fragments_consumer>;

				                using compact_mutations = compact_for_compaction<compacted_fragments_writer, noop_compacted_fragments_consumer>;

				                auto cfc = compact_mutations(*schema(), now,

				                    max_purgeable_func(),

				                    get_tombstone_gc_state(),

				                    get_compacted_fragments_writer(),

				                    noop_compacted_fragments_consumer());

				                    noop_compacted_fragments_consumer(),

				                    &_tombstone_purge_stats);

				                reader.consume_in_thread(std::move(cfc));

				            });

				        });

				@@ -897,7 +914,7 @@ private:

				    // if the derived compaction wants to opt in for this behavior, in addition

				    // to overriding `make_interposer_consumer()`, it would have to override

				    // `use_interposer_consumer()` so it returns true.

				    virtual reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) {

				    virtual mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) {

				        return _table_s.get_compaction_strategy().make_interposer_consumer(_ms_metadata, std::move(end_consumer));

				    }

				@@ -907,13 +924,19 @@ private:

				protected:

				    virtual compaction_result finish(std::chrono::time_point<db_clock> started_at, std::chrono::time_point<db_clock> ended_at) {

				        compaction_result ret {

				            .shard_id = this_shard_id(),

				            .type = _type,

				            .sstables_in = std::move(_input_sstables_basic_info),

				            .sstables_out = extract_basic_info_from_sstables(_all_new_sstables),

				            .new_sstables = std::move(_all_new_sstables),

				            .stats {

				                .started_at = started_at,

				                .ended_at = ended_at,

				                .start_size = _start_size,

				                .end_size = _end_size,

				                .bloom_filter_checks = _bloom_filter_checks,

				                .reader_statistics = std::move(_reader_statistics),

				                .tombstone_purge_stats = std::move(_tombstone_purge_stats),

				            },

				        };

				@@ -1301,7 +1324,7 @@ public:

				    }

				    virtual sstables::sstable_set make_sstable_set_for_input() const override {

				        return sstables::make_partitioned_sstable_set(_schema, false);

				        return sstables::make_partitioned_sstable_set(_schema, _table_s.token_range());

				    }

				    // Unconditionally enable incremental compaction if the strategy specifies a max output size, e.g. LCS.

				@@ -1388,7 +1411,7 @@ public:

				    {

				    }

				    reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) override {

				    mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {

				        return [this, end_consumer = std::move(end_consumer)] (mutation_reader reader) mutable -> future<> {

				            return mutation_writer::segregate_by_token_group(std::move(reader),

				                    _options.classifier,

				@@ -1682,7 +1705,7 @@ public:

				        }

				    }

				    reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) override {

				    mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {

				        if (!use_interposer_consumer()) {

				            return end_consumer;

				        }

				@@ -1778,7 +1801,7 @@ public:

				    }

				    reader_consumer_v2 make_interposer_consumer(reader_consumer_v2 end_consumer) override {

				    mutation_reader_consumer make_interposer_consumer(mutation_reader_consumer end_consumer) override {

				        return [end_consumer = std::move(end_consumer)] (mutation_reader reader) mutable -> future<> {

				            return mutation_writer::segregate_by_shard(std::move(reader), std::move(end_consumer));

				        };

				@@ -1910,7 +1933,11 @@ static future<compaction_result> scrub_sstables_validate_mode(sstables::compacti

				    using scrub = sstables::compaction_type_options::scrub;

				    if (validation_errors != 0 && descriptor.options.as<scrub>().quarantine_sstables == scrub::quarantine_invalid_sstables::yes) {

				        for (auto& sst : descriptor.sstables) {

				            co_await sst->change_state(sstables::sstable_state::quarantine);

				            try {

				                co_await sst->change_state(sstables::sstable_state::quarantine);

				            } catch (...) {

				                clogger.error("Moving {} to quarantine failed due to {}, continuing.", sst->get_filename(), std::current_exception());

				            }

				        }

				    }

									
										11

compaction/compaction.hh
									
												View File
												
				@@ -11,11 +11,14 @@

				#include "readers/combined_reader_stats.hh"

				#include "sstables/shared_sstable.hh"

				#include "sstables/generation_type.hh"

				#include "compaction/compaction_descriptor.hh"

				#include "mutation/mutation_tombstone_stats.hh"

				#include "gc_clock.hh"

				#include "utils/UUID.hh"

				#include "table_state.hh"

				#include <seastar/core/abort_source.hh>

				#include "sstables/basic_info.hh"

				using namespace compaction;

				@@ -72,6 +75,7 @@ struct compaction_data {

				};

				struct compaction_stats {

				    std::chrono::time_point<db_clock> started_at;

				    std::chrono::time_point<db_clock> ended_at;

				    uint64_t start_size = 0;

				    uint64_t end_size = 0;

				@@ -79,13 +83,16 @@ struct compaction_stats {

				    // Bloom filter checks during max purgeable calculation

				    uint64_t bloom_filter_checks = 0;

				    combined_reader_statistics reader_statistics;

				    tombstone_purge_stats tombstone_purge_stats;

				    compaction_stats& operator+=(const compaction_stats& r) {

				        started_at = std::max(started_at, r.started_at);

				        ended_at = std::max(ended_at, r.ended_at);

				        start_size += r.start_size;

				        end_size += r.end_size;

				        validation_errors += r.validation_errors;

				        bloom_filter_checks += r.bloom_filter_checks;

				        tombstone_purge_stats += r.tombstone_purge_stats;

				        return *this;

				    }

				    friend compaction_stats operator+(const compaction_stats& l, const compaction_stats& r) {

				@@ -96,6 +103,10 @@ struct compaction_stats {

				};

				struct compaction_result {

				    shard_id shard_id;

				    compaction_type type;

				    std::vector<sstables::basic_info> sstables_in;

				    std::vector<sstables::basic_info> sstables_out;

				    std::vector<sstables::shared_sstable> new_sstables;

				    compaction_stats stats;

				};

									
										15

compaction/compaction_garbage_collector.hh
									
												View File
												
				@@ -22,7 +22,20 @@ using can_gc_fn = std::function<bool(tombstone, is_shadowable)>;

				extern can_gc_fn always_gc;

				extern can_gc_fn never_gc;

				using max_purgeable_fn = std::function<api::timestamp_type(const dht::decorated_key&, is_shadowable)>;

				struct max_purgeable {

				    enum class timestamp_source {

				        none,

				        memtable_possibly_shadowing_data,

				        other_sstables_possibly_shadowing_data

				    };

				    operator bool() const { return timestamp != api::missing_timestamp; }

				    api::timestamp_type timestamp { api::missing_timestamp };

				    timestamp_source source { timestamp_source::none };

				};

				using max_purgeable_fn = std::function<max_purgeable(const dht::decorated_key&, is_shadowable)>;

				extern max_purgeable_fn can_always_purge;

				extern max_purgeable_fn can_never_purge;

									
										85

compaction/compaction_manager.cc
									
												View File
												
				@@ -26,6 +26,7 @@

				#include "utils/assert.hh"

				#include "utils/error_injection.hh"

				#include "utils/UUID_gen.hh"

				#include "db/compaction_history_entry.hh"

				#include "db/system_keyspace.hh"

				#include "tombstone_gc-internals.hh"

				#include <cmath>

				@@ -385,7 +386,7 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables_a

				    sstables::compaction_result res = co_await compact_sstables(std::move(descriptor), cdata, on_replace, std::move(can_purge));

				    if (should_update_history) {

				        co_await update_history(*_compacting_table, res, cdata);

				        co_await update_history(*_compacting_table, sstables::compaction_result(res), cdata);

				    }

				    co_return res;

				@@ -395,7 +396,7 @@ future<sstables::sstable_set> compaction_task_executor::sstable_set_for_tombston

				    auto compound_set = t.sstable_set_for_tombstone_gc();

				    // Compound set will be linearized into a single set, since compaction might add or remove sstables

				    // to it for incremental compaction to work.

				    auto new_set = sstables::make_partitioned_sstable_set(t.schema(), false);

				    auto new_set = sstables::make_partitioned_sstable_set(t.schema(), t.token_range());

				    co_await compound_set->for_each_sstable_gently([&] (const sstables::shared_sstable& sst) {

				        auto inserted = new_set.insert(sst);

				        if (!inserted) {

				@@ -455,12 +456,11 @@ future<sstables::compaction_result> compaction_task_executor::compact_sstables(s

				    co_return co_await sstables::compact_sstables(std::move(descriptor), cdata, t, _progress_monitor);

				}

				future<> compaction_task_executor::update_history(table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata) {

				future<> compaction_task_executor::update_history(table_state& t, sstables::compaction_result&& res, const sstables::compaction_data& cdata) {

				    auto started_at = std::chrono::duration_cast<std::chrono::milliseconds>(res.stats.started_at.time_since_epoch());

				    auto ended_at = std::chrono::duration_cast<std::chrono::milliseconds>(res.stats.ended_at.time_since_epoch());

				    if (_cm._sys_ks) {

				        auto sys_ks = _cm._sys_ks; // hold pointer on sys_ks

				    if (auto sys_ks = _cm._sys_ks.get_permit()) {

				        co_await utils::get_local_injector().inject("update_history_wait", utils::wait_for_message(120s));

				        std::unordered_map<int32_t, int64_t> rows_merged;

				        for (size_t id=0; id<res.stats.reader_statistics.rows_merged_histogram.size(); ++id) {

				@@ -469,17 +469,32 @@ future<> compaction_task_executor::update_history(table_state& t, const sstables

				            }

				            rows_merged[id] = res.stats.reader_statistics.rows_merged_histogram[id];

				        }

				        co_await sys_ks->update_compaction_history(cdata.compaction_uuid, t.schema()->ks_name(), t.schema()->cf_name(),

				                ended_at.count(), res.stats.start_size, res.stats.end_size, std::move(rows_merged));

				        db::compaction_history_entry entry {

				            .id = cdata.compaction_uuid,

				            .shard_id = res.shard_id,

				            .ks = t.schema()->ks_name(),

				            .cf = t.schema()->cf_name(),

				            .compaction_type = fmt::to_string(res.type),

				            .started_at = started_at.count(),

				            .compacted_at = ended_at.count(),

				            .bytes_in = res.stats.start_size,

				            .bytes_out = res.stats.end_size,

				            .rows_merged = std::move(rows_merged),

				            .sstables_in = std::move(res.sstables_in),

				            .sstables_out = std::move(res.sstables_out),

				            .total_tombstone_purge_attempt = res.stats.tombstone_purge_stats.attempts,

				            .total_tombstone_purge_failure_due_to_overlapping_with_memtable = res.stats.tombstone_purge_stats.failures_due_to_overlapping_with_memtable,

				            .total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable = res.stats.tombstone_purge_stats.failures_due_to_overlapping_with_uncompacting_sstable,

				        };

				        co_await sys_ks->update_compaction_history(std::move(entry));

				    }

				}

				future<> compaction_manager::get_compaction_history(compaction_history_consumer&& f) {

				    if (!_sys_ks) {

				        return make_ready_future<>();

				    if (auto sys_ks = _sys_ks.get_permit()) {

				        co_await sys_ks->get_compaction_history(std::move(f));

				    }

				    return _sys_ks->get_compaction_history(std::move(f)).finally([s = _sys_ks] {});

				}

				template<std::derived_from<compaction::compaction_task_executor> Executor>

				@@ -924,6 +939,7 @@ public:

				compaction_manager::compaction_manager(config cfg, abort_source& as, tasks::task_manager& tm)

				    : _task_manager_module(make_shared<task_manager_module>(tm))

				    , _sys_ks("compaction_manager::system_keyspace")

				    , _cfg(std::move(cfg))

				    , _compaction_submission_timer(compaction_sg(), compaction_submission_callback())

				    , _compaction_controller(make_compaction_controller(compaction_sg(), static_shares(), [this] () -> float {

				@@ -960,6 +976,7 @@ compaction_manager::compaction_manager(config cfg, abort_source& as, tasks::task

				compaction_manager::compaction_manager(tasks::task_manager& tm)

				    : _task_manager_module(make_shared<task_manager_module>(tm))

				    , _sys_ks("compaction_manager::system_keyspace")

				    , _cfg(config{ .available_memory = 1 })

				    , _compaction_submission_timer(compaction_sg(), compaction_submission_callback())

				    , _compaction_controller(make_compaction_controller(compaction_sg(), 1, [] () -> float { return 1.0; }))

				@@ -1128,16 +1145,16 @@ future<> compaction_manager::drain() {

				        // Disable the state so that it can be enabled later if requested.

				        _state = state::disabled;

				    }

				    _compaction_submission_timer.cancel();

				    // Stop ongoing compactions, if the request has not been sent already and wait for them to stop.

				    co_await stop_ongoing_compactions("drain");

				    // Trigger a signal to properly exit from postponed_compactions_reevaluation() fiber

				    reevaluate_postponed_compactions();

				    cmlog.info("Drained");

				}

				future<> compaction_manager::stop() {

				    do_stop();

				    if (auto cm = std::exchange(_task_manager_module, nullptr)) {

				        co_await cm->stop();

				    }

				    if (_stop_future) {

				        co_await std::exchange(*_stop_future, make_ready_future());

				    }

				@@ -1148,16 +1165,18 @@ future<> compaction_manager::really_do_stop() noexcept {

				    // Reset the metrics registry

				    _metrics.clear();

				    co_await stop_ongoing_compactions("shutdown");

				    if (!_tasks.empty()) {

				        on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));

				    }

				    co_await _task_manager_module->stop();

				    co_await coroutine::parallel_for_each(_compaction_state | std::views::values, [] (compaction_state& cs) -> future<> {

				        if (!cs.gate.is_closed()) {

				            co_await cs.gate.close();

				        }

				    });

				    if (!_tasks.empty()) {

				        on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));

				    }

				    reevaluate_postponed_compactions();

				    co_await std::move(_waiting_reevalution);

				    co_await _sys_ks.close();

				    _weight_tracker.clear();

				    _compaction_submission_timer.cancel();

				    co_await _compaction_controller.shutdown();

				@@ -1318,7 +1337,7 @@ protected:

				                    // the weight earlier to remove unnecessary

				                    // serialization.

				                    weight_r.deregister();

				                    co_await update_history(*_compacting_table, res, _compaction_data);

				                    co_await update_history(*_compacting_table, std::move(res), _compaction_data);

				                }

				                _cm.reevaluate_postponed_compactions();

				                continue;

				@@ -1818,8 +1837,21 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst

				    if (!gh) {

				        co_return compaction_stats_opt{};

				    }

				    // All sstables must be included, even the ones being compacted, such that everything in table is validated.

				    auto all_sstables = get_all_sstables(t);

				    // Collect and register all sstables as compacting while compaction is disabled, to avoid a race condition where

				    // regular compaction runs in between and picks the same files.

				    std::vector<sstables::shared_sstable> all_sstables;

				    compacting_sstable_registration compacting(*this, get_compaction_state(&t));

				    co_await run_with_compaction_disabled(t, [&all_sstables, &compacting, &t] () -> future<> {

				        // All sstables must be included.

				        all_sstables = get_all_sstables(t);

				        compacting.register_compacting(all_sstables);

				        return make_ready_future<>();

				    });

				    if (all_sstables.empty()) {

				        co_return compaction_stats_opt{};

				    }

				    co_return co_await perform_compaction<validate_sstables_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id, std::move(all_sstables), quarantine_sstables);

				}

				@@ -1940,7 +1972,7 @@ bool needs_cleanup(const sstables::shared_sstable& sst,

				    dht::token_range sst_token_range = dht::token_range::make(first_token, last_token);

				    auto r = std::lower_bound(sorted_owned_ranges.begin(), sorted_owned_ranges.end(), first_token,

				            [] (const wrapping_interval<dht::token>& a, const dht::token& b) {

				            [] (const interval<dht::token>& a, const dht::token& b) {

				        // check that range a is before token b.

				        return a.after(b, dht::token_comparator());

				    });

				@@ -2178,7 +2210,8 @@ future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sst

				}

				compaction::compaction_state::compaction_state(table_state& t)

				    : backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())

				    : gate(format("compaction_state for table {}.{}", t.schema()->ks_name(), t.schema()->cf_name()))

				    , backlog_tracker(t.get_compaction_strategy().make_backlog_tracker())

				{

				}

				@@ -2294,11 +2327,11 @@ strategy_control& compaction_manager::get_strategy_control() const noexcept {

				}

				void compaction_manager::plug_system_keyspace(db::system_keyspace& sys_ks) noexcept {

				    _sys_ks = sys_ks.shared_from_this();

				    _sys_ks.plug(sys_ks.shared_from_this());

				}

				void compaction_manager::unplug_system_keyspace() noexcept {

				    _sys_ks = nullptr;

				future<> compaction_manager::unplug_system_keyspace() noexcept {

				    co_await _sys_ks.unplug();

				}

				double compaction_backlog_tracker::backlog() const {

									
										14

compaction/compaction_manager.hh
									
												View File
												
				@@ -32,10 +32,11 @@

				#include "seastarx.hh"

				#include "sstables/exceptions.hh"

				#include "tombstone_gc.hh"

				#include "utils/pluggable.hh"

				namespace db {

				class system_keyspace;

				class compaction_history_entry;

				class system_keyspace;

				}

				namespace sstables { class test_env_compaction_manager; }

				@@ -138,7 +139,7 @@ private:

				    // being picked more than once.

				    seastar::named_semaphore _off_strategy_sem = {1, named_semaphore_exception_factory{"off-strategy compaction"}};

				    seastar::shared_ptr<db::system_keyspace> _sys_ks;

				    utils::pluggable<db::system_keyspace> _sys_ks;

				    std::function<void()> compaction_submission_callback();

				    // all registered tables are reevaluated at a constant interval.

				@@ -300,6 +301,11 @@ public:

				    // unless it is moved back to enabled state.

				    future<> drain();

				    // Check if compaction manager is running, i.e. it was enabled or drained

				    bool is_running() const noexcept {

				        return _state == state::enabled || _state == state::disabled;

				    }

				    using compaction_history_consumer = noncopyable_function<future<>(const db::compaction_history_entry&)>;

				    future<> get_compaction_history(compaction_history_consumer&& f);

				@@ -391,7 +397,7 @@ public:

				    future<> run_with_compaction_disabled(compaction::table_state& t, std::function<future<> ()> func);

				    void plug_system_keyspace(db::system_keyspace& sys_ks) noexcept;

				    void unplug_system_keyspace() noexcept;

				    future<> unplug_system_keyspace() noexcept;

				    // Adds a table to the compaction manager.

				    // Creates a compaction_state structure that can be used for submitting

				@@ -548,7 +554,7 @@ protected:

				    future<sstables::compaction_result> compact_sstables(sstables::compaction_descriptor descriptor, sstables::compaction_data& cdata, on_replacement&,

				                                compaction_manager::can_purge_tombstones can_purge = compaction_manager::can_purge_tombstones::yes,

				                                sstables::offstrategy offstrategy = sstables::offstrategy::no);

				    future<> update_history(::compaction::table_state& t, const sstables::compaction_result& res, const sstables::compaction_data& cdata);

				    future<> update_history(::compaction::table_state& t, sstables::compaction_result&& res, const sstables::compaction_data& cdata);

				    bool should_update_history(sstables::compaction_type ct) {

				        return ct == sstables::compaction_type::Compaction;

				    }

									
										2

compaction/compaction_state.hh
									
												View File
												
				@@ -22,7 +22,7 @@ namespace compaction {

				struct compaction_state {

				    // Used both by compaction tasks that refer to the compaction_state

				    // and by any function running under run_with_compaction_disabled().

				    seastar::gate gate;

				    seastar::named_gate gate;

				    // Prevents table from running major and minor compaction at the same time.

				    seastar::rwlock lock;

									
										8

compaction/compaction_strategy.cc
									
												View File
												
				@@ -77,7 +77,7 @@ uint64_t compaction_strategy_impl::adjust_partition_estimate(const mutation_sour

				    return partition_estimate;

				}

				reader_consumer_v2 compaction_strategy_impl::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const {

				mutation_reader_consumer compaction_strategy_impl::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {

				    return end_consumer;

				}

				@@ -741,7 +741,7 @@ uint64_t compaction_strategy::adjust_partition_estimate(const mutation_source_me

				    return _compaction_strategy_impl->adjust_partition_estimate(ms_meta, partition_estimate, std::move(schema));

				}

				reader_consumer_v2 compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const {

				mutation_reader_consumer compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {

				    return _compaction_strategy_impl->make_interposer_consumer(ms_meta, std::move(end_consumer));

				}

				@@ -789,8 +789,8 @@ future<reshape_config> make_reshape_config(const sstables::storage& storage, res

				    };

				}

				std::unique_ptr<sstable_set_impl> incremental_compaction_strategy::make_sstable_set(schema_ptr schema) const {

				    return std::make_unique<partitioned_sstable_set>(std::move(schema), false);

				std::unique_ptr<sstable_set_impl> incremental_compaction_strategy::make_sstable_set(const table_state& ts) const {

				    return std::make_unique<partitioned_sstable_set>(ts.schema(), ts.token_range());

				}

				}

									
										4

compaction/compaction_strategy.hh
									
												View File
												
				@@ -105,13 +105,13 @@ public:

				        return name(type());

				    }

				    sstable_set make_sstable_set(schema_ptr schema) const;

				    sstable_set make_sstable_set(const table_state& ts) const;

				    compaction_backlog_tracker make_backlog_tracker() const;

				    uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr) const;

				    reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const;

				    mutation_reader_consumer make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const;

				    // Returns whether or not interposer consumer is used by a given strategy.

				    bool use_interposer_consumer() const;

									
										4

compaction/compaction_strategy_impl.hh
									
												View File
												
				@@ -56,7 +56,7 @@ public:

				        return true;

				    }

				    virtual int64_t estimated_pending_compactions(table_state& table_s) const = 0;

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const;

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const;

				    bool use_clustering_key_filter() const {

				        return _use_clustering_key_filter;

				@@ -82,7 +82,7 @@ public:

				    /// @return A new functor that wraps the end consumer with additional processing capabilities

				    /// @note The returned functor preserves the original consumer's semantics while allowing

				    ///       preprocessing of data

				    virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const;

				    virtual mutation_reader_consumer make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const;

				    virtual bool use_interposer_consumer() const {

				        return false;

									
										2

compaction/incremental_compaction_strategy.hh
									
												View File
												
				@@ -98,7 +98,7 @@ public:

				    virtual compaction_descriptor get_reshaping_job(std::vector<shared_sstable> input, schema_ptr schema, reshape_config cfg) const override;

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const override;

				    friend class ::incremental_backlog_tracker;

				};

									
										2

compaction/leveled_compaction_strategy.hh
									
												View File
												
				@@ -70,7 +70,7 @@ public:

				    virtual compaction_strategy_type type() const override {

				        return compaction_strategy_type::leveled;

				    }

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const override;

				    virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;

									
										1

compaction/table_state.hh
									
												View File
												
				@@ -33,6 +33,7 @@ namespace compaction {

				class table_state {

				public:

				    virtual ~table_state() {}

				    virtual dht::token_range token_range() const noexcept = 0;

				    virtual const schema_ptr& schema() const noexcept = 0;

				    // min threshold as defined by table.

				    virtual unsigned min_compaction_threshold() const noexcept = 0;

									
										2

compaction/time_window_compaction_strategy.cc
									
												View File
												
				@@ -208,7 +208,7 @@ uint64_t time_window_compaction_strategy::adjust_partition_estimate(const mutati

				    return partition_estimate / std::max(1UL, uint64_t(estimated_window_count));

				}

				reader_consumer_v2 time_window_compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const {

				mutation_reader_consumer time_window_compaction_strategy::make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const {

				    if (ms_meta.min_timestamp && ms_meta.max_timestamp

				            && get_window_for(_options, *ms_meta.min_timestamp) == get_window_for(_options, *ms_meta.max_timestamp)) {

				        return end_consumer;

									
										4

compaction/time_window_compaction_strategy.hh
									
												View File
												
				@@ -150,13 +150,13 @@ public:

				        return compaction_strategy_type::time_window;

				    }

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(schema_ptr schema) const override;

				    virtual std::unique_ptr<sstable_set_impl> make_sstable_set(const table_state& ts) const override;

				    virtual std::unique_ptr<compaction_backlog_tracker::impl> make_backlog_tracker() const override;

				    virtual uint64_t adjust_partition_estimate(const mutation_source_metadata& ms_meta, uint64_t partition_estimate, schema_ptr s) const override;

				    virtual reader_consumer_v2 make_interposer_consumer(const mutation_source_metadata& ms_meta, reader_consumer_v2 end_consumer) const override;

				    virtual mutation_reader_consumer make_interposer_consumer(const mutation_source_metadata& ms_meta, mutation_reader_consumer end_consumer) const override;

				    virtual bool use_interposer_consumer() const override {

				        return true;

									
										3

compound.hh
									
												View File
												
				@@ -255,6 +255,9 @@ public:

				    // Returns true iff given prefix has no missing components

				    bool is_full(managed_bytes_view v) const {

				        SCYLLA_ASSERT(AllowPrefixes == allow_prefixes::yes);

				        if (_types.size() == 0) {

				            return v.empty();

				        }

				        return std::distance(begin(v), end(v)) == (ssize_t)_types.size();

				    }

				    bool is_empty(managed_bytes_view v) const {

1143

compress.cc

View File

File diff suppressed because it is too large Load Diff

									
										97

compress.hh
									
												View File
												
				@@ -10,17 +10,25 @@

				#include <map>

				#include <optional>

				#include <set>

				#include <seastar/core/future.hh>

				#include <seastar/core/shared_ptr.hh>

				#include <seastar/core/sstring.hh>

				#include <seastar/util/bool_class.hh>

				#include "seastarx.hh"

				class compression_parameters;

				class compressor {

				    sstring _name;

				public:

				    compressor(sstring);

				    enum class algorithm {

				        lz4,

				        lz4_with_dicts,

				        zstd,

				        zstd_with_dicts,

				        snappy,

				        deflate,

				        none,

				    };

				    virtual ~compressor() {}

				@@ -42,44 +50,38 @@ public:

				    virtual size_t compress_max_size(size_t input_len) const = 0;

				    /**

				     * Returns accepted option names for this compressor

				     */

				    virtual std::set<sstring> option_names() const;

				    /**

				     * Returns original options used in instantiating this compressor

				     * Returns metadata which must be written together with the compressed

				     * data and used to construct a corresponding decompressor.

				     */

				    virtual std::map<sstring, sstring> options() const;

				    /**

				     * Compressor class name.

				     */

				    const sstring& name() const {

				        return _name;

				    }

				    static bool is_hidden_option_name(std::string_view sv);

				    // to cheaply bridge sstable compression options / maps

				    using opt_string = std::optional<sstring>;

				    using opt_getter = std::function<opt_string(const sstring&)>;

				    using ptr_type = shared_ptr<compressor>;

				    std::string name() const;

				    static ptr_type create(const sstring& name, const opt_getter&);

				    static ptr_type create(const std::map<sstring, sstring>&);

				    virtual algorithm get_algorithm() const = 0;

				    static thread_local const ptr_type lz4;

				    static thread_local const ptr_type snappy;

				    static thread_local const ptr_type deflate;

				    virtual std::optional<unsigned> get_dict_owner_for_test() const;

				    static sstring make_name(std::string_view short_name);

				    using ptr_type = std::unique_ptr<compressor>;

				};

				template<typename BaseType, typename... Args>

				class class_registry;

				using compressor_ptr = compressor::ptr_type;

				using compressor_registry = class_registry<compressor, const typename compressor::opt_getter&>;

				compressor_ptr make_lz4_sstable_compressor_for_tests();

				// Per-table compression options, parsed and validated.

				//

				// Compression options are configured through the JSON-like `compression` entry in the schema.

				// The CQL layer parses the text of that entry to a `map<string, string>`.

				// A `compression_parameters` object is constructed from this map.

				// and the passed keys and values are parsed and validated in the constructor.

				// This object can be then used to create a `compressor` objects for sstable readers and writers.

				class compression_parameters {

				public:

				    using algorithm = compressor::algorithm;

				    static constexpr std::string_view name_prefix = "org.apache.cassandra.io.compress.";

				    static constexpr int32_t DEFAULT_CHUNK_LENGTH = 4 * 1024;

				    static constexpr double DEFAULT_CRC_CHECK_CHANCE = 1.0;

				@@ -88,26 +90,47 @@ public:

				    static const sstring CHUNK_LENGTH_KB_ERR;

				    static const sstring CRC_CHECK_CHANCE;

				private:

				    compressor_ptr _compressor;

				    algorithm _algorithm;

				    std::optional<int> _chunk_length;

				    std::optional<double> _crc_check_chance;

				    std::optional<int> _zstd_compression_level;

				public:

				    compression_parameters();

				    compression_parameters(compressor_ptr);

				    compression_parameters(algorithm);

				    compression_parameters(const std::map<sstring, sstring>& options);

				    ~compression_parameters();

				    compressor_ptr get_compressor() const { return _compressor; }

				    int32_t chunk_length() const { return _chunk_length.value_or(int(DEFAULT_CHUNK_LENGTH)); }

				    double crc_check_chance() const { return _crc_check_chance.value_or(double(DEFAULT_CRC_CHECK_CHANCE)); }

				    algorithm get_algorithm() const { return _algorithm; }

				    std::optional<int> zstd_compression_level() const { return _zstd_compression_level; }

				    using dicts_feature_enabled = bool_class<struct dicts_feature_enabled_tag>;

				    using dicts_usage_allowed = bool_class<struct dicts_usage_allowed_tag>;

				    void validate(dicts_feature_enabled, dicts_usage_allowed) const;

				    void validate();

				    std::map<sstring, sstring> get_options() const;

				    bool operator==(const compression_parameters& other) const;

				    static compression_parameters no_compression() {

				        return compression_parameters(nullptr);

				    bool compression_enabled() const { 

				        return _algorithm != algorithm::none;

				    }

				    static compression_parameters no_compression() {

				        return compression_parameters(algorithm::none);

				    }

				    bool operator==(const compression_parameters&) const = default;

				    static std::string_view algorithm_to_name(algorithm);

				    static std::string algorithm_to_qualified_name(algorithm);

				private:

				    void validate_options(const std::map<sstring, sstring>&);

				    static void validate_options(const std::map<sstring, sstring>&);

				    static algorithm name_to_algorithm(std::string_view name);

				};

				// Stream operator for boost::program_options support

				std::istream& operator>>(std::istream& is, compression_parameters& cp);

				template <>

				struct fmt::formatter<compression_parameters> : fmt::formatter<std::string_view> {

				    auto format(const compression_parameters& cp, fmt::format_context& ctx) const -> decltype(ctx.out()) {

				        return fmt::format_to(ctx.out(), "{}", cp.get_options());

				    }

				};

									
										37

conf/scylla.yaml
									
												View File
												
				@@ -825,7 +825,9 @@ maintenance_socket: ignore

				# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.

				# enable_create_table_with_compact_storage: false

				# Enable tablets for new keyspaces.

				# Control tablets for new keyspaces.

				# Can be set to: disabled|enabled

				#

				# When enabled, newly created keyspaces will have tablets enabled by default.

				# That can be explicitly disabled in the CREATE KEYSPACE query

				# by using the `tablets = {'enabled': false}` replication option.

				@@ -834,6 +836,37 @@ maintenance_socket: ignore

				# unless tablets are explicitly enabled in the CREATE KEYSPACE query

				# by using the `tablets = {'enabled': true}` replication option.

				#

				# When set to `enforced`, newly created keyspaces will always have tablets enabled by default.

				# This prevents explicitly disabling tablets in the CREATE KEYSPACE query

				# using the `tablets = {'enabled': false}` replication option.

				# It also mandates a replication strategy supporting tablets, like

				# NetworkTopologyStrategy

				#

				# Note that creating keyspaces with tablets enabled or disabled is irreversible.

				# The `tablets` option cannot be changed using `ALTER KEYSPACE`.

				enable_tablets: true

				tablets_mode_for_new_keyspaces: enabled

				# Enforce RF-rack-valid keyspaces.

				rf_rack_valid_keyspaces: false

				#

				# Alternator options

				#

				# Maximum number of items in single BatchWriteItem command. Default is 100.

				# Note: DynamoDB has a hard-coded limit of 25.

				# alternator_max_items_in_batch_write: 100

				# 

				# io-streaming rate limiting

				# When setting this value to be non-zero scylla throttles disk throughput for

				# stream (network) activities such as backup, repair, tablet migration and more.

				# This limit is useful for user queries so the network interface does 

				# not get saturated by streaming activities.

				# The recommended value is 75% of network bandwidth

				# E.g for i4i.8xlarge (https://github.com/scylladb/scylla-machine-image/tree/next/common/aws_net_params.json):

				# network: 18.75 GiB/s --> 18750 Mib/s --> 1875 MB/s (from network bits to network bytes: divide by 10, not 8)

				# Converted to disk bytes: 1875 * 1000 / 1024 = 1831 MB/s (disk wise)

				# 75% of disk bytes is: 0.75 * 1831 = 1373 megabytes/s

				# stream_io_throughput_mb_per_sec: 1373

				#

Compare commits

1863 Commits auto-backp ... next-2025.

14 .github/CODEOWNERS vendored Unescape Escape View File

97 .github/ISSUE_TEMPLATE/bug_report.yml vendored Unescape Escape View File

50 .github/scripts/auto-backport.py vendored Unescape Escape View File

16 .github/seastar-bad-include.json vendored Normal file Unescape Escape View File

2 .github/workflows/backport-pr-fixes-validation.yaml vendored Unescape Escape View File

53 .github/workflows/call_backport_with_jira.yaml vendored Normal file Unescape Escape View File

133 .github/workflows/conflict_reminder.yaml vendored Unescape Escape View File

24 .github/workflows/iwyu.yaml vendored Unescape Escape View File

7 .github/workflows/make-pr-ready-for-review.yaml vendored Unescape Escape View File

2 .github/workflows/pr-require-backport-label.yaml vendored Unescape Escape View File

5 .gitmodules vendored Unescape Escape View File

15 CMakeLists.txt Unescape Escape View File

25 HACKING.md Unescape Escape View File

2 SCYLLA-VERSION-GEN Unescape Escape View File

11 alternator/consumed_capacity.cc Unescape Escape View File

6 alternator/consumed_capacity.hh Unescape Escape View File

1124 alternator/executor.cc View File

57 alternator/executor.hh Unescape Escape View File

24 alternator/expressions.cc Unescape Escape View File

12 alternator/expressions.g Unescape Escape View File

2 alternator/expressions.hh Unescape Escape View File

3 alternator/rmw_operation.hh Unescape Escape View File

45 alternator/server.cc Unescape Escape View File

2 alternator/server.hh Unescape Escape View File

153 alternator/stats.cc Unescape Escape View File

18 alternator/stats.hh Unescape Escape View File

15 alternator/streams.cc Unescape Escape View File

111 alternator/ttl.cc Unescape Escape View File

58 api/api-doc/compaction_manager.json Unescape Escape View File

8 api/api-doc/gossiper.json Unescape Escape View File

164 api/api-doc/storage_service.json Unescape Escape View File

8 api/api-doc/tasks.json Unescape Escape View File

27 api/api.cc Unescape Escape View File

83 api/api.hh Unescape Escape View File

47 api/column_family.cc Unescape Escape View File

25 api/compaction_manager.cc Unescape Escape View File

2 api/config.cc Unescape Escape View File

24 api/failure_detector.cc Unescape Escape View File

24 api/gossiper.cc Unescape Escape View File

2 api/messaging_service.cc Unescape Escape View File

2 api/service_levels.cc Unescape Escape View File

399 api/storage_service.cc Unescape Escape View File

10 api/storage_service.hh Unescape Escape View File

96 api/tasks.cc Unescape Escape View File

8 api/tasks.hh Unescape Escape View File

22 api/token_metadata.cc Unescape Escape View File

9 audit/audit.cc Unescape Escape View File

63 audit/audit_syslog_storage_helper.cc Unescape Escape View File

3 audit/audit_syslog_storage_helper.hh Unescape Escape View File

4 auth/allow_all_authenticator.cc Unescape Escape View File

3 auth/allow_all_authenticator.hh Unescape Escape View File

5 auth/certificate_authenticator.cc Unescape Escape View File

3 auth/certificate_authenticator.hh Unescape Escape View File

5 auth/common.cc Unescape Escape View File

3 auth/common.hh Unescape Escape View File

19 auth/ldap_role_manager.cc Unescape Escape View File

8 auth/ldap_role_manager.hh Unescape Escape View File

8 auth/maintenance_socket_role_manager.cc Unescape Escape View File

8 auth/maintenance_socket_role_manager.hh Unescape Escape View File

108 auth/password_authenticator.cc Unescape Escape View File

14 auth/password_authenticator.hh Unescape Escape View File

14 auth/passwords.cc Unescape Escape View File

20 auth/passwords.hh Unescape Escape View File

13 auth/role_manager.hh Unescape Escape View File

5 auth/saslauthd_authenticator.cc Unescape Escape View File

3 auth/saslauthd_authenticator.hh Unescape Escape View File

14 auth/service.cc Unescape Escape View File

4 auth/service.hh Unescape Escape View File

129 auth/standard_role_manager.cc Unescape Escape View File

13 auth/standard_role_manager.hh Unescape Escape View File

7 auth/transitional.cc Unescape Escape View File

5 bytes.hh Unescape Escape View File

5 cdc/cdc_extension.hh Unescape Escape View File

31 cdc/generation.cc Unescape Escape View File

9 cdc/generation_service.hh Unescape Escape View File

76 cdc/log.cc Unescape Escape View File

2 cmake/mode.Coverage.cmake Unescape Escape View File

2 cmake/mode.Debug.cmake Unescape Escape View File

1863 Commits

auto-backp ... next-2025.

14

.github/CODEOWNERS vendored

View File

97

.github/ISSUE_TEMPLATE/bug_report.yml vendored

View File

50

.github/scripts/auto-backport.py vendored

View File

16

.github/seastar-bad-include.json vendored Normal file

View File

2

.github/workflows/backport-pr-fixes-validation.yaml vendored

View File

53

.github/workflows/call_backport_with_jira.yaml vendored Normal file

View File

133

.github/workflows/conflict_reminder.yaml vendored

View File

24

.github/workflows/iwyu.yaml vendored

View File

7

.github/workflows/make-pr-ready-for-review.yaml vendored

View File

2

.github/workflows/pr-require-backport-label.yaml vendored

View File

5

.gitmodules vendored

View File

15

CMakeLists.txt

View File

25

HACKING.md

View File

2

SCYLLA-VERSION-GEN

View File

11

alternator/consumed_capacity.cc

View File

6

alternator/consumed_capacity.hh

View File

1124

alternator/executor.cc

View File

57

alternator/executor.hh

View File

24

alternator/expressions.cc

View File

12

alternator/expressions.g

View File

2

alternator/expressions.hh

View File

3

alternator/rmw_operation.hh

View File

45

alternator/server.cc

View File

2

alternator/server.hh

View File

153

alternator/stats.cc

View File

18

alternator/stats.hh

View File

15

alternator/streams.cc

View File

111

alternator/ttl.cc

View File

58

api/api-doc/compaction_manager.json

View File

8

api/api-doc/gossiper.json

View File

164

api/api-doc/storage_service.json

View File

8

api/api-doc/tasks.json

View File

27

api/api.cc

View File

83

api/api.hh

View File

47

api/column_family.cc

View File

25

api/compaction_manager.cc

View File

2

api/config.cc

View File

24

api/failure_detector.cc

View File

24

api/gossiper.cc

View File

2

api/messaging_service.cc

View File

2

api/service_levels.cc

View File

399

api/storage_service.cc

View File

10

api/storage_service.hh

View File

96

api/tasks.cc

View File

8

api/tasks.hh

View File

22

api/token_metadata.cc

View File

9

audit/audit.cc

View File

63

audit/audit_syslog_storage_helper.cc

View File

3

audit/audit_syslog_storage_helper.hh

View File

4

auth/allow_all_authenticator.cc

View File

3

auth/allow_all_authenticator.hh

View File

5

auth/certificate_authenticator.cc

View File

3

auth/certificate_authenticator.hh

View File

5

auth/common.cc

View File

3

auth/common.hh

View File

19

auth/ldap_role_manager.cc

View File

8

auth/ldap_role_manager.hh

View File

8

auth/maintenance_socket_role_manager.cc

View File

8

auth/maintenance_socket_role_manager.hh

View File

108

auth/password_authenticator.cc

View File

14

auth/password_authenticator.hh

View File

14

auth/passwords.cc

View File

20

auth/passwords.hh

View File

13

auth/role_manager.hh

View File

5

auth/saslauthd_authenticator.cc

View File

3

auth/saslauthd_authenticator.hh

View File

14

auth/service.cc

View File

4

auth/service.hh

View File

129

auth/standard_role_manager.cc

View File

13

auth/standard_role_manager.hh

View File

7

auth/transitional.cc

View File

5

bytes.hh

View File

5

cdc/cdc_extension.hh

View File

31

cdc/generation.cc

View File

9

cdc/generation_service.hh

View File

76

cdc/log.cc

View File

2

cmake/mode.Coverage.cmake

View File

2

cmake/mode.Debug.cmake

View File

2

cmake/mode.Dev.cmake

View File