Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.
This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).
This change:
- Rewrite cql_server::connection::process_request_one to use
seastar::futurize_invoke and try_catch<> instead of utils::result_try
- Add new read_failure_exception_with_timeout and throws it in storage_proxy
- Add sleep in CQL server when the new exception is caught
- Catch local exceptions in Mapreduce Service and convert them
to std::runtime_error.
- Add get_cql_exclusive to manager_client.py
- Add test_long_query_timeout_erm
No backport needed - minor issue fix.
Closesscylladb/scylladb#23156
* github.com:scylladb/scylladb:
test: add test_long_query_timeout_erm
test: add get_cql_exclusive to manager_client.py
mapreduce: catch local read_failure_exception_with_timeout
transport: storage_proxy: release ERM when waiting for query timeout
transport: remove redundant references in process_request_one
transport: fix the indentation in process_request_one
transport: add futures in CQL server exception handling
Before this change, if a read executor had just enough targets to
achieve query's CL, and there was a connection drop (e.g. node failure),
the read executor waited for the entire request timeout to give drivers
time to execute a speculative read in a meantime. Such behavior don't
work well when a very long query timeout (e.g. 1800s) is set, because
the unfinished request blocks topology changes.
This change implements a mechanism to thrown a new
read_failure_exception_with_timeout in the aforementioned scenario.
The exception is caught by CQL server which conducts the waiting, after
ERM is released. The new exception inherits from read_failure_exception,
because layers that don't catch the exception (such as mapreduce
service) should handle the exception just a regular read_failure.
However, when CQL server catch the exception, it returns
read_timeout_exception to the client because after additional waiting
such an error message is more appropriate (read_timeout_exception was
also returned before this change was introduced).
This change:
- Add new read_failure_exception_with_timeout exception
- Add throw of read_failure_exception_with_timeout in storage_proxy
- Add abort_source to CQL server, as well as to_stop() method for
the correct abort handling
- Add sleep in CQL server when the new exception is caught
Refs #21831
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.
This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.
Fixesscylladb/scylladb#23325Fixesscylladb/scylladb#23305Fixesscylladb/scylladb#21815
Backport: looks like this is quite a frequent issue, therefore backport to 2025.1.
Closesscylladb/scylladb#23336
* github.com:scylladb/scylladb:
database: Pass schema_ptr as const ref in `wrap_commitlog_add_error`
database: Unify exception handling in `do_apply` and `apply_with_commitlog`
storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down.
exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.
Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error`
in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped
in a `nested_exception`, an error message is printed, causing tests to fail.
This commit adds handling for nested exceptions in this case to prevent unnecessary
error messages.
Fixesscylladb/scylladb#23325
Fix UBSan abort caused by integer overflow when calculating time difference
between read and write operations. The issue occurs when:
1. The queried partition on replicas is not purgeable (has no recorded
modified time)
2. Digests don't match across replicas
3. The system attempts to calculate timespan using missing/negative
last_modified timestamps
This change skips cross-DC repair optimization when write timestamp is
negative or missing, as this optimization is only relevant for reads
occurring within write_timeout of a write.
Error details:
```
service/storage_proxy.cc:5532:80: runtime error: signed integer overflow: -9223372036854775808 - 1741940132787203 cannot be represented in type 'int64_t' (aka 'long')
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior service/storage_proxy.cc:5532:80
Aborting on shard 1, in scheduling group sl:default
```
Related to previous fix 39325cf which handled negative read_timestamp cases.
Fixes#23314
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23359
Previously, the code used a find_if to compare each digest to the first
one to check for any mismatches. This was less readable. This change
replaces that with `std::ranges::all_of`, which checks if all elements
in the range are equal to the first digest, improving readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23332
"
This is series starts conversion of the gossiper to use host ids to
index nodes. It does not touch the main map yet, but converts a lot of
internal code to host id. There are also some unrelated cleanups that
were done while working on the series. On of which is dropping code
related to old shadow round. We replaced shadow round with explicit
GOSSIP_GET_ENDPOINT_STATES verb in cd7d64f588
which is in scylla-4.3.0, so there should be no compatibility problem.
We already dropped a lot of old shadow round code in previous patches
anyway.
I tested manually that old and new node can co-exist in the same
cluster,
"
* 'gleb/gossiper-host-id-v2' of github.com:scylladb/scylla-dev: (33 commits)
gossiper: drop unneeded code
gossiper: move _expire_time_endpoint_map to host_id
gossiper: move _just_removed_endpoints to host id
gossiper: drop unused get_msg_addr function
messaging_service: change connection dropping notification to pass host id only
messaging_service: pass host id to remove_rpc_client in down notification
treewide: pass host id to endpoint_lifecycle_subscriber
treewide: drop endpoint life cycle subscribers that do nothing
load_meter: move to host id
treewide: use host id directly in endpoint state change subscribers
treewide: pass host id to endpoint state change subscribers
gossiper: drop deprecated unsafe_assassinate_endpoint operation
storage_service: drop unused code in handle_state_removed
treewide: drop endpoint state change subscribers that do nothing
gossiper: drop ip address from handle_echo_msg and simplify code since host_id is now mandatory
gossiper: start using host ids to send messages earlier
messaging_service: add temporary address map entry on incoming connection
topology_coordinator: notify about IP change from sync_raft_topology_nodes as well
treewide: move everyone to use host id based gossiper::is_alive and drop ip based one
storage_proxy: drop unused template
...
Draining hints may occur in one of the two scenarios:
* a node leaves the cluster and the local node drains all of the hints
saved for that node,
* the local node is being decommissioned.
Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.
There are two reasons for that:
* Generally, in situations like that, we'd like to be able to shut down
nodes as fast as possible. The data stored in the hints won't
disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
have the highest priority and it's reflected in the scheduling groups we
use as well as the explicitly enforced throughput. If there are a large
number of hints to be replayed, it might affect our tests.
It's already happened, see: scylladb/scylladb#21949.
To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.
That should ensure all data is retained, while possibly speeding up
the shutdown process.
There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.
We also provide tests verifying that the implementation works as intended.
Fixesscylladb/scylladb#21949Closesscylladb/scylladb#22811
This change adds two log messages. One for the creation of the truncate
global topology request, and another for the truncate timeout. This is
added in order to help with tracking truncate operation events.
It also extends the "Another global topology request is ongoing, please
retry." error message with more information: keyspace and table name.
Currently, we can not have more than one global topology operation at
the same time. This means that we can not have concurrent truncate
operations because truncate is implemented as a global topology
operation.
Truncate excludes with other topology operations, and has to wait for
those to complete before truncate starts executing. This can lead to
truncate timeouts. In these cases the client retries the truncate operation,
which will check for ongoing global topology operations, and will fail with
an "Another global topology request is ongoing, please retry." error.
This can be avoided by truncate checking if we have a truncate for the same
table already queued. In this case, we can wait for the ongoing truncate to
complete instead of immediatelly failing the operation, and provide a better
user experience.
In order to allow concurrent truncate table operations (for the time being,
only for a single table) we have to remove the limitation allowing only one
truncate table fiber per shard.
This change adds the ability to collect the active truncate fibers in
storage_proxy::remote into std::list<> instead of having just a single
truncate fiber. These fibers are waited for completion during
storage_proxy::remote::stop().
Said method passes down its `diff` input to `mutate_internal()`, after
some std::ranges massaging. Said massaging is destructive -- it moves
items from the diff. If the output range is iterated-over multiple
times, only the first time will see the actual output, further
iterations will get an empty range.
When trace-level logging is enabled, this is exactly what happens:
`mutate_internal()` iterates over the range multiple times, first to log
its content, then to pass it down the stack. This ends up resulting in
a range with moved-from elements being pased down and consequently write
handlers being created with nullopt mutations.
Make the range re-entrant by materializing it into a vector before
passing it to `mutate_internal()`.
Fixes: scylladb/scylladb#21907Fixes: scylladb/scylladb#21714Closesscylladb/scylladb#21910
Currently, the session ID under which the truncate for tablets request is
running is created during the request creation and queuing. This is a problem
because this could overwrite the session ID of any ongoing operation on
system.topology#session
This change moves the creation of the session ID for truncate from the request
creation to the request handling.
Fixes#22613Closesscylladb/scylladb#22615
It is used by truncate code only and even there it only check if the
returned set is not empty. Check for dead token owners in the truncation
code directly.
One of its caller is in the RESTful API which gets ips from the user, so
we convert ips to ids inside the API handler using gossiper before
calling the function. We need to deprecate ip based API and move to host
id based.
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This update addresses an issue in the mutation diff calculation
algorithm used during read repair. Previously, the algorithm used
`token` as the hashmap key. Since `token` is calculated basing on the
Murmur3 hash function, it could generate duplicate values for different
partition keys, causing corruption in the affected rows' values.
Fixesscylladb/scylladb#19101
Since the issue affects all the relevant scylla versions, backport to: 6.1, 6.2
Closesscylladb/scylladb#21996
* github.com:scylladb/scylladb:
storage_proxy/read_repair: Remove redundant 'schema' parameter from `data_read_resolver::resolve` function.
storage_proxy/read_repair: Use `partition_key` instead of `token` key for mutation diff calculation hashmap.
test: Add test case for checking read repair diff calculation when having conflicting keys.
replica writes are delayed according to the view update backlog in order
to apply backpressure and reduce the rate of incoming base writes when
the backlog is large, allowing slow replicas to catch up.
previously the backlog calculation considered only the pending targets,
excluding targets that replied successfuly, probably due to confusion in
the code. instead, we want to consider the backlog of all the targets
participating in the write.
Fixesscylladb/scylladb#21672Closesscylladb/scylladb#21935
function.
The `data_read_resolver` class inherits from `abstract_read_resolver`, which already includes the
`schema_ptr _schema` member. Therefore, using a separate function parameter in `data_read_resolver::resolve`
initialized with the same variable in `abstract_read_executor` is redundant.
diff calculation hashmap.
This update addresses an issue in the mutation diff calculation algorithm used during read repair.
Previously, the algorithm used `token` as the hashmap key. Since `token` is calculated basing on
the Murmur3 hash function, it could generate duplicate values for different partition keys, causing
corruption in the affected rows' values.
Fixesscylladb/scylladb#19101
topology::sort_by_proximity already sorts the local node
address first, if present, so look it up only when
using SimpleSnitch, where sort_by_proximity() is a no-op.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The data resolved has to apply all mutations from all replica to a
single mutation. In the extreme case, when all rows are dead, the
mutations can have around 10K rows in them. This is not a huge amount,
but it is enough to cause moderate stalls of <20ms.
To avoid this, use the gentle variant of apply(), which can yield in the
middle.
Fixes: scylladb/scylladb#21818Closesscylladb/scylladb#21884
This pull request is continuation of scylladb/scylladb#20688 - contents of the main commit are the same, the only change is the additional commit with a test.
Until this patch, the materialized view flow-control algorithm (https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/) used a constant delay_limit_us hard-coded to one second, which means that when the size of view-update backlog reached the maximum (10% of memory), we delay every request by an additional second - while smaller amounts of backlog will result in smaller delays.
This hard-coded one maximum second delay was considered *huge* - it will slow down a client with concurrency 1000 to just 1000 requests per second - but we already saw some workloads where it was not enough - such as a test workload running very slow reads at high concurrency on a slow machine, where a latency of over one second was expected for each read, so adding a one second latecy for writes wasn't having any noticable affect on slowing down the client.
So this patch replaces the hard-coded default with a live-updateable configuration parameter, `view_flow_control_delay_limit_in_ms`, which defaults to 1000ms as before.
Another useful way in which the new `view_flow_control_delay_limit_in_ms` can be used is to set it to 0. In that case, the view-update flow control always adds zero delay, and in effect - does absolutely nothing. This setting can be used in emergency situations where it is suspected that the MV flow control is not behaving properly, and the user wants to disable it.
The new parameter's help string mentions both these use cases of the parameter.
Fixes#18187
This is new functionality, no need to backport to any open source release.
Closesscylladb/scylladb#21647
* github.com:scylladb/scylladb:
materialized views: test for the MV delay configuration parameter
service: add injection for skipping view update backlog
materialized view: make flow-control maximum delay configurable
This change goes thru locator:topology to use node&
instead of node* where nullptr is not possible. There are
places where the node object is used in unordered_set, in
those cases the node is wrapped in std::reference_wrapper.
Fixesscylladb/scylladb#20357Closesscylladb/scylladb#21863
This commit adds the code needed to create a TRUNCATE global topology
request. It also adds the handler for this request to the topology
coordinator.
The execution of the truncate operation is not canceled on a timeout,
but the query coordinator side will return a timeout error.
Until this patch, the materialized view flow-control algorithm
(https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/)
used a constant delay_limit_us hard-coded to one second, which means
that when the size of view-update backlog reached the maximum (10%
of memory), we delay every request by an additional second - while
smaller amounts of backlog will result in smaller delays.
This hard-coded one maximum second delay was considered *huge* - it will
slow down a client with concurrency 1000 to just 1000 requests per
second - but we already saw some workloads where it was not enough -
such as a test workload running very slow reads at high concurrency
on a slow machine, where a latency of over one second was expected
for each read, so adding a one second latecy for writes wasn't having
any noticable affect on slowing down the client.
So this patch replaces the hard-coded default with a live-updateable
configuration parameter, `view_flow_control_delay_limit_in_ms`, which
defaults to 1000ms as before.
Another useful way in which the new `view_flow_control_delay_limit_in_ms`
can be used is to set it to 0. In that case, the view-update flow
control always adds zero delay, and in effect - does absolutely
nothing. This setting can be used in emergency situations where it
is suspected that the MV flow control is not behaving properly, and
the user wants to disable it.
The new parameter's help string mentions both these use cases of
the parameter.
Fixes#18187
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This change introduces a new truncate_with_tablets RPC with a parameter
of type service::frozen_topology_guard. This is materialized on replica
nodes into a topology_guard which guarantees that truncate is performed
under a global session, which, in turn, makes sure that we don't execute
truncate as a result of stale RPCs.
Also, this RPC does not have a timeout. Timeout will be handled on the
coordinator side, and the truncate operation will not be allowed to time
out.
This commit makes storage_proxy::remote dependent on raft_group0_client
and topology_state_machine. storage_proxy::remote gets references to these via
the call to start_remote(). These references will be needed to call
storage_service::truncate_table_with_tablets().