scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Yaron Kaikov	f5a2fcab72	Add JIRA issue validation to backport PR fixes check Extend the Fixes validation pattern to also accept JIRA issue references (format: [A-Z]+-\d+) in addition to GitHub issue references. This allows backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'. Fixes: https://github.com/scylladb/scylladb/issues/27571 Closes scylladb/scylladb#27572 (cherry picked from commit `3dfa5ebd7f`) Closes scylladb/scylladb#27599	2025-12-12 09:35:49 +02:00
Jenkins Promoter	2fe49ce031	Update ScyllaDB version to: 2025.3.6	2025-12-10 10:46:07 +02:00
Anna Stuchlik	f9d19cab8a	replace the Driver pages with a link to the new Drivers pages This commit removes the now redundant driver pages from the Scylla DB documentation. Instead, the link to the pages where we moved the diver information is added. Also, the links are updated across the ScyllaDB manual. Redirections are added for all the removed pages. Fixes https://github.com/scylladb/scylladb/issues/26871 Closes scylladb/scylladb#27277 (cherry picked from commit `c5580399a8`) Closes scylladb/scylladb#27440	2025-12-10 09:26:11 +01:00
Tomasz Grabiec	750e9da1e8	Merge '[Backport 2025.3] tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Scylladb[bot] Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to reballance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016 - (cherry picked from commit `c9f0a9d0eb`) - (cherry picked from commit `0dcaaa061e`) - (cherry picked from commit `2b03a69065`) Parent PR: #26017 Closes scylladb/scylladb#26218 * github.com:scylladb/scylladb: test: perf: perf-load-balancing: Add parallel-scaleout scenario test: perf: perf-load-balancing: Convert to tool_app_template tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true load_balancer: include dead nodes when calculating rack load	2025-12-09 23:54:20 +01:00
Tomasz Grabiec	ebc07c360f	test: perf: perf-load-balancing: Add parallel-scaleout scenario Simulates reblancing on a single scale-out involving simultaneous addition of multiple nodes per rack. Default parameters create a cluster with 2 racks, 70 tables, 256 tablets/table, 10 nodes, 88 shards/node. Adds 6 nodes in parallel (3 per rack). Current result on my laptop: testlog - Rebalance took 21.874 [s] after 82 iteration(s) (cherry picked from commit `2b03a69065`)	2025-12-09 14:04:19 +01:00
Tomasz Grabiec	b7db86611c	test: perf: perf-load-balancing: Convert to tool_app_template To support sub-commands for testing different scenarios. The current scenario is given the name "rolling-add-dec". (cherry picked from commit `0dcaaa061e`)	2025-12-09 14:04:19 +01:00
Tomasz Grabiec	37824ec021	tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to rebalance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016 (cherry picked from commit `c9f0a9d0eb`)	2025-12-09 14:04:19 +01:00
Wojciech Mitros	ef250e58dd	load_balancer: include dead nodes when calculating rack load Load balancer aims to preserve a balance in rack loads when generating tablet migrations. However, this balance might get broken when dead nodes are present. Currently, these nodes aren't include in rack load calculations, even if they own tablet replicas. As a result, load balancer treats racks with dead nodes as racks with a lower load, so I generates migrations to these racks. This is incorrect, because a dead node might come back alive, which would result in having multiple tablet replicas on the same rack. It's also inefficient even if we know that the node won't come back - when it's being replaced or removed. In that case we know we are going to rebuild the lost tablet replicas so migrating tablets to this rack just doubles the work. Allowing such migrations to happen would also require adjustments in the materialized view pairing code because we'd temporarily allow having multiple tablet replicas on the same rack. So in this patch we include dead nodes when calculating rack loads in the load balancer. The dead nodes still aren't treated as potential migration sources or destinations. We also add a test which verifies that no migrations are performed by doing a node replace with a mv workload in parallel. Before the patch, we'd get pairing errors and after the patch, no pairing errors are detected. Fixes https://github.com/scylladb/scylladb/issues/24485 Closes scylladb/scylladb#26028	2025-12-09 14:04:19 +01:00
Tomasz Grabiec	4acf082686	Merge '[Backport 2025.3] address_map: Use more efficient and reliable replication method' from Scylladb[bot] Primary issue with the old method is that each update is a separate cross-shard call, and all later updates queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated, because we update mapping on each change of gossip states. This made bootstrap impossible because nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 - (cherry picked from commit `ed8d127457`) - (cherry picked from commit `4a85ea8eb2`) - (cherry picked from commit `f83c4ffc68`) Parent PR: #26941 Closes scylladb/scylladb#27188 * github.com:scylladb/scylladb: address_map: Use barrier() to wait for replication address_map: Use more efficient and reliable replication method utils: Introduce helper for replicated data structures utils: add "fatal" version of utils::on_internal_error()	2025-12-05 13:23:19 +01:00
Avi Kivity	3f343d70e4	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418 (cherry picked from commit `9696ee64d0`) scylla-2025.3.5-candidate-20251205050220 scylla-2025.3.5	2025-12-04 20:18:13 +02:00
Tomasz Grabiec	fed0f95626	address_map: Use barrier() to wait for replication More efficient than 100 pings. There was one ping in test which was done "so this shard notices the clock advance". It's not necessary, since obsering completed SMP call implies that local shard sees the clock advancement done within in. (cherry picked from commit `f83c4ffc68`)	2025-12-04 14:50:31 +01:00
Tomasz Grabiec	eba97f80e0	address_map: Use more efficient and reliable replication method Primary issue with the old method is that each update is a separate cross-shard call, and all later updated queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated. This made bootstrap impossible, since nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Fixes #26835 (cherry picked from commit `4a85ea8eb2`)	2025-12-04 14:50:31 +01:00
Tomasz Grabiec	777b54f072	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards. (cherry picked from commit `ed8d127457`)	2025-12-04 14:50:31 +01:00
Nadav Har'El	6562e844f8	utils: add "fatal" version of utils::on_internal_error() utils::on_internal_error() is a wrapper for Seastar's on_internal_error() which does not require a logger parameter - because it always uses one logger ("on_internal_error"). Not needing a unique logger is especially important when using on_internal_error() in a header file, where we can't define a logger. Seastar also has a another similar function, on_fatal_internal_error(), for which we forgot to implement a "utils" version (without a logger parameter). This patch fixes that oversight. In the next patch, we need to use on_fatal_internal_error() in a header file, so the "utils" version will be useful. We will need the fatal version because we will encounter an unexpected situation during server destruction, and if we let the regular on_internal_error() just throw an exception, we'll be left in an undefined state. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `33476c7b06`)	2025-12-04 14:50:31 +01:00
Pavel Emelyanov	1d9e0c17a6	Update seastar submodule (SIGABRT on assertion) * seastar 4431d974f...f61814a48 (1): > util: make SEASTAR_ASSERT() failure generate SIGABRT Fixes #27127 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27403	2025-12-04 13:00:30 +03:00
Ernest Zaslavsky	aa8495d465	s3_client: handle additional transient network errors Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request. Fixes: https://github.com/scylladb/scylladb/issues/27349 Closes scylladb/scylladb#27350 (cherry picked from commit `605f71d074`) Closes scylladb/scylladb#27390 scylla-2025.3.5-candidate-20251203022600	2025-12-03 12:25:15 +03:00
Calle Wilund	7eb51568e9	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236 (cherry picked from commit `59c87025d1`) Closes scylladb/scylladb#27343	2025-12-03 12:25:02 +03:00
Pavel Emelyanov	232cbe2f69	Merge '[Backport 2025.3] tablet: scheduler: Do not emit conflicting migration in merge colocation' from Scylladb[bot] The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 backport to existing releases - this is a bug that can affect correctness - (cherry picked from commit `97b7c03709`) Parent PR: #27312 Closes scylladb/scylladb#27330 * github.com:scylladb/scylladb: tablet: scheduler: Do not emit conflicting migration in merge colocation tablet: scheduler: Do not emit conflicting migrations in the plan	2025-12-03 12:24:48 +03:00
Aleksandra Martyniuk	8a3932e4d9	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165 (cherry picked from commit `19a7d8e248`) Closes scylladb/scylladb#27198	2025-12-03 12:24:29 +03:00
Ernest Zaslavsky	99a51cf695	streaming:: add more logging Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags Fixes: https://github.com/scylladb/scylladb/issues/27299 Closes scylladb/scylladb#27311 (cherry picked from commit `1d5f60baac`) Closes scylladb/scylladb#27341	2025-12-02 12:13:21 +01:00
Jenkins Promoter	a99b7020dd	Update pgo profiles - aarch64	2025-12-01 05:12:52 +02:00
Jenkins Promoter	8456b9520b	Update pgo profiles - x86_64	2025-12-01 04:35:55 +02:00
Michael Litvak	4ac402c5c5	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312 (cherry picked from commit `97b7c03709`)	2025-11-30 10:13:42 +01:00
Tomasz Grabiec	951d3f50ea	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048 (cherry picked from commit `981592bca5`)	2025-11-30 10:00:22 +01:00
Patryk Jędrzejczak	a07e0d46ae	Merge '[Backport 2025.3] locator/node: include _excluded in missing places' from Scylladb[bot] We currently ignore the `_excluded` field in `node::clone()` and the verbose formatter of `locator::node`. The first one is a bug that can have unpredictable consequences on the system. The second one can be a minor inconvenience during debugging. We fix both places in this PR. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72 This PR is a bugfix that should be backported to all supported branches. - (cherry picked from commit `4160ae94c1`) - (cherry picked from commit `287c9eea65`) Parent PR: #27265 Closes scylladb/scylladb#27290 * https://github.com/scylladb/scylladb: locator/node: include _excluded in verbose formatter locator/node: preserve _excluded in clone()	2025-11-27 12:29:19 +01:00
Avi Kivity	69871fe600	Merge '[Backport 2025.3] fix notification about expiring erm held for to long' from Scylladb[bot] Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should. Fixes https://github.com/scylladb/scylladb/issues/27141 - (cherry picked from commit `9f97c376f1`) - (cherry picked from commit `5dcdaa6f66`) Parent PR: #27140 Closes scylladb/scylladb#27275 * github.com:scylladb/scylladb: test: test that expired erm that held for too long triggers notification token_metadata: fix notification about expiring erm held for to long	2025-11-27 12:10:38 +02:00
Patryk Jędrzejczak	2307bf891d	locator/node: include _excluded in verbose formatter It can be helpful during debugging. (cherry picked from commit `287c9eea65`)	2025-11-26 23:04:48 +00:00
Patryk Jędrzejczak	f288273ef0	locator/node: preserve _excluded in clone() We currently ignore the `_excluded` field in `clone()`. Losing information about exclusion can have unpredictable consequences. One observed effect (that led to finding this issue) is that the `/storage_service/nodes/excluded` API endpoint sometimes misses excluded nodes. (cherry picked from commit `4160ae94c1`)	2025-11-26 23:04:48 +00:00
Gleb Natapov	aa75444438	test: test that expired erm that held for too long triggers notification (cherry picked from commit `5dcdaa6f66`)	2025-11-26 15:08:41 +00:00
Gleb Natapov	e2d59df166	token_metadata: fix notification about expiring erm held for to long Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix assign operator to call destructor. (cherry picked from commit `9f97c376f1`)	2025-11-26 15:08:41 +00:00
Ernest Zaslavsky	7e6b653e5c	streaming: fix loop break condition in tablet_sstable_streamer::stream Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved. (cherry picked from commit `dedc8bdf71`) Closes scylladb/scylladb#27153 Fixes: #26979 Parent PR: #26980 Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools	2025-11-25 11:59:01 +03:00
Avi Kivity	84b7e06268	tools: toolchain: prepare: replace 'reg' with 'skopeo' The prepare scripts uses 'reg' to verify we're not going to overwrite an existing image. The 'reg' command is not available in Fedora 43. Use 'skopeo' instead. Skopeo is part of the podman ecosystem so hopefully will live longer. Fixes #27178. Closes scylladb/scylladb#27179 (cherry picked from commit `d6ef5967ef`) Closes scylladb/scylladb#27199	2025-11-24 16:32:04 +02:00
Jenkins Promoter	812fc721cd	Update ScyllaDB version to: 2025.3.5	2025-11-24 15:50:44 +02:00
Raphael S. Carvalho	867cb1e7ac	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078 (cherry picked from commit `74ecedfb5c`) Closes scylladb/scylladb#27155	2025-11-21 17:48:21 +03:00
Patryk Jędrzejczak	a9fc235aee	test: test_raft_recovery_stuck: ensure mutual visibility before using driver Not waiting for nodes to see each other as alive can cause the driver to fail the request sent in `wait_for_upgrade_state()`. scylladb/scylladb#19771 has already replaced concurrent restarts with `ManagerClient.rolling_restart()`, but it has missed this single place, probably because we do concurrent starts here. Fixes #27055 Closes scylladb/scylladb#27075 (cherry picked from commit `e35ba974ce`) Closes scylladb/scylladb#27109	2025-11-20 10:41:58 +02:00
Botond Dénes	78ecb8854a	Merge '[Backport 2025.3] Automatic cleanup improvements' from Scylladb[bot] This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. - (cherry picked from commit `e872f9cb4e`) - (cherry picked from commit `0f0ab11311`) Parent PR: #26868 Closes scylladb/scylladb#27093 * github.com:scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-20 10:41:04 +02:00
Botond Dénes	d2d9140029	Merge '[Backport 2025.3] encryption::kms_host: Add exponential backoff-retry for 503 errors' from Scylladb[bot] Refs #26822 Fixes #27062 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. - (cherry picked from commit `190e3666cb`) - (cherry picked from commit `d22e0acf0b`) Parent PR: #26934 Closes scylladb/scylladb#27063 * github.com:scylladb/scylladb: encryption::kms_host: Add exponential backoff-retry for 503 errors encryption::kms_host: Include http error code in kms_error	2025-11-20 10:40:20 +02:00
Botond Dénes	91e6efdde8	Merge '[Backport 2025.3] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot] The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. - (cherry picked from commit `c0f7622d12`) - (cherry picked from commit `222eab45f8`) - (cherry picked from commit `394207fd69`) - (cherry picked from commit `b357c8278f`) Parent PR: #26856 Closes scylladb/scylladb#27039 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-20 10:39:48 +02:00
Botond Dénes	a067723f55	Merge '[Backport 2025.3] cdc: set column drop timestamp in the future' from Scylladb[bot] When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes https://github.com/scylladb/scylladb/issues/26340 the issue affects all previous releases, backport to improve stability - (cherry picked from commit `eefae4cc4e`) - (cherry picked from commit `48298e38ab`) - (cherry picked from commit `039323d889`) - (cherry picked from commit `e85051068d`) Parent PR: #26533 Closes scylladb/scylladb#27036 * github.com:scylladb/scylladb: test: test concurrent writes with column drop with cdc preimage cdc: check if recreating a column too soon cdc: set column drop timestamp in the future	2025-11-20 10:39:18 +02:00
Gleb Natapov	b53bf43844	cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster `97ab3f6622` changed "nodetool cleanup" (without arguments) to run cleanup on all dirty nodes in the cluster. This was somewhat unexpected, so this patch changes it back to run cleanup on the target node only (and reset "cleanup needed" flag afterwards) and it adds "nodetool cluster cleanup" command that runs the cleanup on all dirty nodes in the cluster. (cherry picked from commit `0f0ab11311`)	2025-11-19 10:53:42 +02:00
Gleb Natapov	3d60e5e825	cleanup: Add RESTful API to allow reset cleanup needed flag Cleaning up a node using per keyspace/table interface does not reset cleanup needed flag in the topology. The assumption was that running cleanup on already clean node does nothing and completes quickly. But due to https://github.com/scylladb/scylladb/issues/12215 (which is closed as WONTFIX) this is not the case. This patch provides the ability to reset the flag in the topology if operator cleaned up the node manually already. (cherry picked from commit `e872f9cb4e`)	2025-11-19 10:44:30 +02:00
Avi Kivity	e9e849c2bf	Merge '[Backport 2025.3] Synchronize tablet split and load-and-stream' from Scylladb[bot] Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. - (cherry picked from commit `3abc66da5a`) - (cherry picked from commit `4654cdc6fd`) Parent PR: #26456 Closes scylladb/scylladb#26648 * github.com:scylladb/scylladb: sstables_loader: Don't bypass synchronization with busy topology test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream	2025-11-17 17:14:36 +02:00
Calle Wilund	484e7aed2c	encryption::kms_host: Add exponential backoff-retry for 503 errors Refs #26822 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. v2: * Use utils::exponential_backoff_retry (cherry picked from commit `d22e0acf0b`)	2025-11-17 11:48:42 +00:00
Calle Wilund	77407fd704	encryption::kms_host: Include http error code in kms_error Keep track of actual HTTP failure. (cherry picked from commit `190e3666cb`)	2025-11-17 11:48:41 +00:00
Benny Halevy	898f193ef6	scylla-sstable: correctly dump sharding_metadata This patch fixes 2 issues at one go: First, Currently sstables::load clears the sharding metadata (via open_data()), and so scylla-sstable always prints an empty array for it. Second, printing token values would generate invalid json as they are currently printed as binary bytes, and they should be printed simply as numbers, as we do elsewhere, for example, for the first and last keys. Fixes #26982 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26991 (cherry picked from commit `f9ce98384a`) Closes scylladb/scylladb#27037 scylla-2025.3.4-candidate-20251117021558 scylla-2025.3.4	2025-11-16 15:43:35 +02:00
Michael Litvak	ba40c1eba9	test: test concurrent writes with column drop with cdc preimage add a test that writes to a table concurrently with dropping a column, where the table has CDC enabled with preimage. the test reproduces issue #26340 where this results in a malformed sstable. (cherry picked from commit `e85051068d`)	2025-11-16 10:03:07 +01:00
Michael Litvak	28eaa12af9	cdc: check if recreating a column too soon When we drop a column from a CDC log table, we set the column drop timestamp a few seconds into the future. This can cause unexpected problems if a user tries to recreate a CDC column too soon, before the drop timestamp has passed. To prevent this issue, when creating a CDC column we check its creation timestamp against the existing drop timestamp, if any, and fail with an informative error if the recreation attempt is too soon. (cherry picked from commit `039323d889`)	2025-11-16 10:03:07 +01:00
Michael Litvak	c37d224db6	cdc: set column drop timestamp in the future When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes scylladb/scylladb#26340 (cherry picked from commit `48298e38ab`)	2025-11-16 09:34:51 +01:00
Dawid Mędrek	7b32c277fe	test/cluster/test_maintenance_mode.py: Wait for initialization If we try to perform queries too early, before the call to `storage_service::start_maintenance_mode` has finished, we will fail with the following error: ``` ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index! ``` To avoid that, we should wait until initialization is complete. (cherry picked from commit `b357c8278f`)	2025-11-15 22:10:28 +00:00
Dawid Mędrek	6d6f870a5f	test: Disable maintenance mode correctly in test_maintenance_mode.py Although setting the value of `maintenance_mode` to the string `"false"` disables maintenance mode, the testing framework misinterprets the value and thinks that it's actually enabled. As a result, it might try to connect to Scylla via the maintenance socket, which we don't want. (cherry picked from commit `394207fd69`)	2025-11-15 22:10:28 +00:00

1 2 3 4 5 ...

48669 Commits