Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.
Closesscylladb/scylladb#27743
(cherry picked from commit 4a5292e815)
Closesscylladb/scylladb#27911
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.
Closesscylladb/scylladb#27757
(cherry picked from commit 04976875cc)
Closesscylladb/scylladb#27865
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.
When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.
In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.
Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
fails the test
This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.
Fixes: #26347Closesscylladb/scylladb#27245
(cherry picked from commit d883ff2317)
Closesscylladb/scylladb#27505
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.
Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.
Fixes: scylladb/scylladb#27136
Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.
Closesscylladb/scylladb#27737
(cherry picked from commit 2a75b1374e)
Closesscylladb/scylladb#27780
In several exception handlers, only `raft::request_aborted` was being caught and rethrown, while `seastar::abort_requested_exception` was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal.
For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation"
This change adds `seastar::abort_requested_exception` handling alongside `raft::request_aborted` in all places where it was missing. When rethrown, these exceptions propagate up to the main `run()` loop where `handle_topology_coordinator_error()` recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations.
Fixes: scylladb/scylladb#27255
No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch.
- (cherry picked from commit 37e3dacf33)
Parent PR: #27314Closesscylladb/scylladb#27661
* https://github.com/scylladb/scylladb:
topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.
Fixes https://github.com/scylladb/scylladb/issues/27142
Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.
- (cherry picked from commit 82f80478b8)
- (cherry picked from commit 6a6bbbf1a6)
- (cherry picked from commit 86dde50c0d)
Parent PR: #27387Closesscylladb/scylladb#27480
* https://github.com/scylladb/scylladb:
direct_failure_detector: run direct failure detector in the gossiper scheduling group
raft: drop invoke_on from the pinger verb handler
direct_failure_detector: pass timeout to direct_fd_ping verb
idl, message: make with_timeout and cancellable verb attributes composable
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.
For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"
This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.
Fixes: scylladb/scylladb#27255
(cherry picked from commit 37e3dacf33)
When draining the view builder, we abort ongoing operations using the
view builder's abort source, which may cause them to fail with
abort_requested_exception or raft::request_aborted exceptions.
Since these failures are expected during shutdown, reduce the log level
in add_new_view from 'error' to 'debug' for these specific exceptions
while keeping 'error' level for unexpected failures.
Closesscylladb/scylladb#26297
(cherry picked from commit 6bc41926e2)
Closesscylladb/scylladb#27538
Ensure all direct and global topology commands rethrow the
`raft::request_aborted` exception when aborted, typically due to
leadership changes. This makes abortion explicit to callers, enabling
proper handling such as retries or workflow termination.
This change completes the work started in PR scylladb/scylladb#23962,
covering all remaining cases where the exception was not rethrown.
Fixes: scylladb/scylladb#23589
(cherry picked from commit 943af1ef1c)
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.
To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.
Fixes#27531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27532
(cherry picked from commit 5f13880a91)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27581
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.
Redirections are added for all the removed pages.
Fixes https://github.com/scylladb/scylladb/issues/26871Closesscylladb/scylladb#27277
(cherry picked from commit c5580399a8)
Closesscylladb/scylladb#27438
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
(cherry picked from commit 86dde50c0d)
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.
(cherry picked from commit 6a6bbbf1a6)
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
(cherry picked from commit 82f80478b8)
And define `send_message_timeout_cancellable` in rpc_protocol_impl.hh
using the newly introduced rpc_handler entry point
in seastar that accepts both timeout and cancellable params.
Note that the interface to the user still uses abort_source
while internally the funtion allocates a seastar::rpc::cancellable
object. It is possible to provide an interface that will accept
a rpc::cancellable& from the caller, but the existing messaging api
uses abort_source. Changing it may be considered in the future.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0b97806771)
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865
- (cherry picked from commit ed8d127457)
- (cherry picked from commit 4a85ea8eb2)
- (cherry picked from commit f83c4ffc68)
Parent PR: #26941Closesscylladb/scylladb#27187
* github.com:scylladb/scylladb:
address_map: Use barrier() to wait for replication
address_map: Use more efficient and reliable replication method
utils: Introduce helper for replicated data structures
utils: add "fatal" version of utils::on_internal_error()
More efficient than 100 pings.
There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
(cherry picked from commit f83c4ffc68)
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.
This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.
To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.
It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.
Fixes#26865Fixes#26835
(cherry picked from commit 4a85ea8eb2)
Key goals:
- efficient (batching updates)
- reliable (no lost updates)
Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
(cherry picked from commit ed8d127457)
utils::on_internal_error() is a wrapper for Seastar's on_internal_error()
which does not require a logger parameter - because it always uses one
logger ("on_internal_error"). Not needing a unique logger is especially
important when using on_internal_error() in a header file, where we
can't define a logger.
Seastar also has a another similar function, on_fatal_internal_error(),
for which we forgot to implement a "utils" version (without a logger
parameter). This patch fixes that oversight.
In the next patch, we need to use on_fatal_internal_error() in a header
file, so the "utils" version will be useful. We will need the fatal
version because we will encounter an unexpected situation during server
destruction, and if we let the regular on_internal_error() just throw
an exception, we'll be left in an undefined state.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 33476c7b06)
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:
```c++
// [1, 2, 3, 0] --> [0, 1, 3, 6]
std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```
However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.
As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.
An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57
The fix is simple: the initial value is passed as uint64_t instead of int.
Fixes https://github.com/scylladb/scylladb/issues/27417Closesscylladb/scylladb#27418
(cherry picked from commit 9696ee64d0)
Fixes#24346
When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.
This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.
Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).
v2:
* Added unit test
Closesscylladb/scylladb#27236
(cherry picked from commit 59c87025d1)
Closesscylladb/scylladb#27340
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.
We fix both places in this PR.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72
This PR is a bugfix that should be backported to all supported branches.
- (cherry picked from commit 4160ae94c1)
- (cherry picked from commit 287c9eea65)
Parent PR: #27265Closesscylladb/scylladb#27289
* https://github.com/scylladb/scylladb:
locator/node: include _excluded in verbose formatter
locator/node: preserve _excluded in clone()
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.
Fixes https://github.com/scylladb/scylladb/issues/27141
- (cherry picked from commit 9f97c376f1)
- (cherry picked from commit 5dcdaa6f66)
Parent PR: #27140Closesscylladb/scylladb#27274
* github.com:scylladb/scylladb:
test: test that expired erm that held for too long triggers notification
token_metadata: fix notification about expiring erm held for to long
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
(cherry picked from commit 4160ae94c1)
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
(cherry picked from commit 9f97c376f1)
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.
(cherry picked from commit dedc8bdf71)
Closesscylladb/scylladb#27150Fixes: #26979
Parent PR: #26980
Unfortunatelly the pytest based test cannot be ported back because of changes made to the testing harness and scylla-tools
The prepare scripts uses 'reg' to verify we're not going to
overwrite an existing image. The 'reg' command is not
available in Fedora 43. Use 'skopeo' instead. Skopeo
is part of the podman ecosystem so hopefully will live longer.
Fixes#27178.
Closesscylladb/scylladb#27179
(cherry picked from commit d6ef5967ef)
Closesscylladb/scylladb#27196
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes
It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.
Fixes#26229.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#27078
(cherry picked from commit 74ecedfb5c)
Closesscylladb/scylladb#27152
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
(cherry picked from commit e35ba974ce)
Closesscylladb/scylladb#27108
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
- (cherry picked from commit e872f9cb4e)
- (cherry picked from commit 0f0ab11311)
Parent PR: #26868Closesscylladb/scylladb#27091
* github.com:scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixes scylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
- (cherry picked from commit c0f7622d12)
- (cherry picked from commit 222eab45f8)
- (cherry picked from commit 394207fd69)
- (cherry picked from commit b357c8278f)
Parent PR: #26856Closesscylladb/scylladb#27034
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
(cherry picked from commit 0f0ab11311)
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
(cherry picked from commit e872f9cb4e)
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
- (cherry picked from commit 3abc66da5a)
- (cherry picked from commit 4654cdc6fd)
Parent PR: #26456Closesscylladb/scylladb#26647
* github.com:scylladb/scylladb:
sstables_loader: Don't bypass synchronization with busy topology
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream