The get_blob() method linearizes data by copying it into a
single buffer, which can trigger "oversized allocation" warnings.
This commit avoids that extra copy by creating an input stream
directly over the original fragmented managed bytes returned by
untyped_result_set_row::get_view().
Fixesscylladb/scylladb#23903
A decommissioned node is removed from a raft config after operation is
marked as completed. This is required since otherwise the decommissioned
node will not see that decommission has completed (the status is
propagated through raft). But right after the decommission is marked as
completed a decommissioned node may terminate, so in case of a two node
cluster, the configuration change that removes it from the raft will fail,
because there will no be quorum.
The solution is to mark the decommissioning node as non voter before
reporting the operation as completed.
Fixes: #24026
Backport to 2025.2 because it fixes a potential hang. Don't backport to
branches older than 2025.2 because they don't have
8b186ab0ff, which caused this issue.
Closesscylladb/scylladb#24027
Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node.
Fixes: #22777
Backport to 2025.1 since this fixes a crash. Older version do not have the code.
Closesscylladb/scylladb#24000
* https://github.com/scylladb/scylladb:
test: add reproducer for #22777
storage_service: Do not remove gossiper entry on address change
storage_service: use id to check for local node
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.
This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).
This change:
- Rewrite cql_server::connection::process_request_one to use
seastar::futurize_invoke and try_catch<> instead of utils::result_try
- Add new read_failure_exception_with_timeout and throws it in storage_proxy
- Add sleep in CQL server when the new exception is caught
- Catch local exceptions in Mapreduce Service and convert them
to std::runtime_error.
- Add get_cql_exclusive to manager_client.py
- Add test_long_query_timeout_erm
No backport needed - minor issue fix.
Closesscylladb/scylladb#23156
* github.com:scylladb/scylladb:
test: add test_long_query_timeout_erm
test: add get_cql_exclusive to manager_client.py
mapreduce: catch local read_failure_exception_with_timeout
transport: storage_proxy: release ERM when waiting for query timeout
transport: remove redundant references in process_request_one
transport: fix the indentation in process_request_one
transport: add futures in CQL server exception handling
test_tablet_repair_hosts_filter checks whether the host filter
specfied for tablet repair is correctly persisted. To check this,
we need to ensure that the repair is still ongoing and its data
is kept. The test achieves that by failing the repair on replica
side - as the failed repair is going to be retried.
However, if the filter does not contain any host (included_host_count = 0),
the repair is started on no replica, so the request succeeds
and its data is deleted. The test fails if it checks the filter
after repair request data is removed.
Fail repair on topology coordinator side, so the request is ongoing
regardless of the specified hosts.
Fixes: #23986.
Closesscylladb/scylladb#24003
When the topology coordinator is shut down while doing a long-running
operation, the current operation might throw a raft::request_aborted
exception. This is not a critical issue and should not be logged with
ERROR verbosity level.
Make sure that all the try..catch blocks in the topology coordinator
which:
- May try to acquire a new group0 guard in the `try` part
- Have a `catch (...)` block that print an ERROR-level message
...have a pass-through `catch (raft::request_aborted&)` block which does
not log the exception.
Fixes: scylladb/scylladb#22649Closesscylladb/scylladb#23962
Add sleep before starting gossiper to increase a chance of getting old
gossiper entry about yourself before updating local gossiper info with
new IP address.
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.
This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.
Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.
Fixes: scylladb/scylladb#23588Fixes: scylladb/scylladb#23786
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing
voters in each data center and rack.
This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.
Fixes: scylladb/scylladb#23950
Refactor the limited voters calculator to use a priority queue for
sorting nodes by their priorities. This change simplifies the voter
selection logic and makes it more extensible for future enhancements,
such as supporting more complex priority calculations.
The priority value is determined based on the node's existing status,
including whether it is alive, a voter, or any further criteria.
The output parameter cannot be `null`. Previously, a pointer was used to
make it explicit that the parameter is an output parameter being
modified. However, this is unnecessary, as references are more
appropriate for parameters that cannot be `null`.
Switching to a reference improves code readability and ensures the
parameter's non-null constraint is enforced at the type level.
Refactor the group0 voter handler by introducing a helper lambda to
handle the common logic for adding a node. This eliminates unnecessary
code duplication.
This refactor does not introduce any functional changes but prepares
the codebase for easier future modifications.
Refactor the code to use a consistent pattern for creating the
datacenter info list and the rack info list.
Both now use a map of vectors, which improves efficiency by reducing
temporary conversions to maps/sets during node list processing.
Also ensure the node descriptor is passed by reference instead of by
copy, leveraging the guaranteed lifetime of the descriptors.
This dependency is already there, topology coordinator doesn't need
to use database reference to get to the features.
Previous patch of the same kind: b79137eaa4
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23777
When gossiper indexed entries by ip an old entry had to be removed on an
address change, but the index is id based, so even if ip was change the
entry should stay. Gossiper simply updates an ip address there.
Currently, test_tablet_resize_revoked tries to trigger split revoke
by deleting some rows. This method isn't deterministic and so a test
is flaky.
Use error injection to trigger resize revoke.
Fixes: #22570.
Closesscylladb/scylladb#23966
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.
Fix this by adding the missing break.
Fixes: scylladb/scylladb#23897
From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.
Closesscylladb/scylladb#23914
* https://github.com/scylladb/scylladb:
test: cluster: add test_bad_initial_token
topology coordinator: do not proceed further on invalid boostrap tokens
cdc: add sanity check for generating an empty generation
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.
Fixes: https://github.com/scylladb/scylladb/issues/22514.
Needs backport to 2025.1 and 6.2 as they contain the bug.
Closesscylladb/scylladb#23787
* github.com:scylladb/scylladb:
test: add test for getting tasks children
tasks: check whether a node is alive before rpc
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.
Fix this by adding the missing break.
Fixes: scylladb/scylladb#23897
In the following patch we plan to remove the base schema from the base_info
to make the base_info immutable. To do that, we first prepare the schema
registry for the change; we need to be able to create view schemas from
frozen schemas there and frozen schemas have no information about the base
table. Unless we do this change, after base schemas are removed from the
base info, we'll no longer be able to load a view schema to the schema registry
without looking up the base schema in the database.
This change also required some updates to schema building:
* we add a method for unfreezing a view schema with base info instead of
a base schema
* we make it possible to use schema_builder with a base info instead of
a base schema
* we add a method for creating a view schema from mutations with a base info
instead of a base schema
* we add a view_info constructor withat base info instead of a base schema
* we update the naming in schema_registry to reflect the usage of base info
instead of base schema
Mapreduce Service exception handling differs for local and remote RPC
calls of dispatch_to_shards. Whereas local exceptions are handled
normally, the remote exceptions are converted to rpc::remote_verb_error
by the framework. This is a substantial difference when
read_failure_exception_with_timeout is thrown during mapreduce query
execution - CQL server waits for the exception from the local call but
not from the remote one.
As we don't want to wait for the timeout in CQL server in either of
the cases, this commit catches the local exception (especially
read_failure_exception_with_timeout) and converts it to
std::runtime_error (the one from which rpc::remote_verb_error inherits).
Ideally, Mapreduce Service should execute dispatch_to_shards through RPC
for both local and remote calls. However, such change negatively affects
tens of Unit Tests that rely on the possibility to run local mapreduce
service without any RPC.
This change:
- Catch local exceptions in Mapreduce Service and convert them
to std::runtime_error.
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.
This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).
This change:
- Add new read_failure_exception_with_timeout exception
- Add throw of read_failure_exception_with_timeout in storage_proxy
- Add abort_source to CQL server, as well as to_stop() method for
the correct abort handling
- Add sleep in CQL server when the new exception is caught
Refs #21831
Changing DC or rack on a node which was already bootstrapped is, in
case of vnodes, very unsafe (almost guaranteed to cause data loss or
unavailability), and is outright not supported if the cluster has
a tablet-backed keyspaces. Moreover, the possibility of doing that
makes it impossible to uphold some of the invariants promised by
the RF-rack-valid flag, which is eventually going to become
unconditionally enabled.
Get rid of the above problems by removing the possibility of changing
the DC / rack of a node. A node will now fail to start if its snitch
reports a different DC or rack than the one that was reported during the
first boot.
Fixes: scylladb/scylladb#23278Fixes: scylladb/scylladb#22869
Marking for backport to 2025.1, as this is a necessary part of the RF-rack-valid saga
Closesscylladb/scylladb#23800
* github.com:scylladb/scylladb:
doc: changing topology when changing snitches is no longer supported
test: cluster: introduce test_no_dc_rack_change
storage_service: don't update DC/rack in update_topology_with_local_metadata
main: make dc and rack immutable after bootstrap
test: cluster: remove test_snitch_change
The DC/rack are now immutable and cannot be changed after restart, so
there is no need to update the node's system.topology entry with this
information on restart.
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.
Fixes#23755
* The issue exists since 6.2
Closesscylladb/scylladb#17712
* github.com:scylladb/scylladb:
topology_coordinator: stop: await all background_action_holder:s
topology_coordinator: stop: improve error messages
topology_coordinator: stop: define stop_background_action helper
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.
Fixes#23755
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the code to use a helper to await background_action_holder
and handle any errors by printing a warning.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.
For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.
Fixesscylladb/scylladb#21637
Backport: 6.2 and 6.1
Closesscylladb/scylladb#22779
* github.com:scylladb/scylladb:
Ensure raft group0 RPCs use the gossip scheduling group
Move RAFT operations verbs to GOSSIP group.
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.
Unit test:
Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.
Fixes https://github.com/scylladb/scylladb/issues/20073.
Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.
Closesscylladb/scylladb#23751
* github.com:scylladb/scylladb:
storage_service: add unit test for mid-decommission transit_tablet()
storage_service: preserve state of busy topology when transiting tablet
This dependency is already there, storage service doesn't need to go
rounds via database reference to get to the features.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23739
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
Commit 876478b84f ("storage_service: allow concurrent tablet migration
in tablets/move API", 2024-02-08) introduced a code path on which the
topology state machine would be busy -- in "tablet_draining" or
"tablet_migration" state -- at the time of starting tablet migration. The
pre-commit code would unconditionally transition the topology to
"tablet_migration" state, assuming the topology had been idle previously.
On the new code path, this state change would be idempotent if the
topology state machine had been busy in "tablet_migration", but the state
change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain
push_back(). emplace_back() is not helpful here, as
topology_mutation_builder::build() cannot construct in-place, and so we
invoke the "canonical_mutation" move constructor once, either way.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
The `_remaining_nodes` attribute of the data center information was not
initialized correctly. The parameter was passed by value to the
initialization function instead of by reference or pointer.
As a result, `_remaining_nodes` was left initialized to zero, causing an
underflow when decrementing its value.
This bug did not significantly impact behavior because other safeguards,
such as capping the maximum voters per data center by the total number
of nodes, masked the issue. However, it could lead to inefficiencies, as
the remaining nodes check would not trigger correctly.
Fixes: scylladb/scylladb#23702
No backport: The bug is only present in the master branch, so no backport
is required.
Closesscylladb/scylladb#23704
A recent commit 370707b111 (re)introduced
a timeout for every group0 Raft operation. This timeout was set to 60
seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody".
However, one of the things we do as a group0 operation is schema
changes, and we already noticed a few years ago, see commit
0b2cf21932, that in some extremely
overloaded test machines where tests run hundreds of times (!) slower
than usual, a single big schema operation - such as Alternator's
DeleteTable deleting a table and multiple of its CDC or view tables -
sometimes takes more than 60 seconds. The above fix changed the
client's timeout to wait for 300 seconds instead of 60 seconds,
but now we also need to increase our Raft timeout, or the server can
time out. We've seen this happening recently making some tests flaky
in CI (issue #23543).
So let's make this timeout configurable, as a new configuration option
group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e,
60 seconds), the same as the existing default. The test framework
overrides this default with a a higher 300 second timeout, matching
the client-side timeout.
Before this patch, this timeout was already configurable in a strange
way, using injections. But this was a misstep: We already have more
than a dozen timeouts configurable through the normal configration,
and this one should have been configured in the same way. There is
nothing "holy" about the default of 60 seconds we chose, and who
knows maybe in the future we might need to tweek it in the field,
just like we made the other timeouts tweakable. Injections cannot
be used in release mode, but configuration options can.
Fixes#23543
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23717
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.