3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.
Refs #15422
No backport needed since this removes functionality.
Closesscylladb/scylladb#28514
* https://github.com/scylladb/scylladb:
group0: fix indentation after previous patch
raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
raft_group0: remove unused code from raft_group0
node_ops: remove topology over node ops code
topology: fix indentation after the previous patch
topology: drop topology_change_enabled parameter from raft_group0 code
storage_service: remove unused handle_state_* functions
gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
storage_service: fix indentation after the last patch
storage_service: remove gossiper bootstrapping code
storage_service: drop get_group_server_if_raft_topolgy_enabled
storage_service: drop is_topology_coordinator_enabled and its uses
storage_service: drop run_with_api_lock_in_gossiper_mode_only
topology: remove code that assumes raft_topology_change_enabled() may return false
test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
test: schema_change_test: drop schema tests relevant for no raft mode only
topology: remove upgrade to raft topology code
group0: remove upgrade to group0 code
group0: refuse to boot if a cluster is still is not in a raft topology mode
storage_service: refuse to join a cluster in legacy mode
This series closes a gap in how CQL request and response sizes are reported.
Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.
After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.
The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns
To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).
The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**
Fixes#14850Closesscylladb/scylladb#28419
* github.com:scylladb/scylladb:
test/boost/estimated_histogram_test.cc: Switch to real Sum
transport/server: to bytes_histogram
approx_exponential_histogram: Add sum() method for accurate value tracking
utils/estimated_histogram.hh: Add bytes_histogram
When a permit is preemptively aborted, store the corresponding
exception in permit's member: `reader_permit::impl::_ex`.
This makes preemptively-aborted permits consistently report aborted()
and prevents them from being treated as eligible for inactive
registration in `register_inactive_read()`, avoiding assertion
failures on unexpected permit state.
Closesscylladb/scylladb#28591
Refs: SCYLLADB-193
Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op.
Logic is similar, and thus broken out and semi-shared with, truncation.
Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as
tablet sstable ownership, repair status, and token ranges.
As per description in SCYLLADB-193, the alternative snapshot mechanism is in
a separate namespace under 'tablets', which while dubious is the desired destination.
The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op.
TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool).
Requires a syntax for both API and command line.
Closesscylladb/scylladb#28525
* github.com:scylladb/scylladb:
topology::snapshot: Add expiry (ttl) to RPC/topo op
test_snapshot_with_tablets: Extend test to check manifest content
table::manifest: Add tablet info to manifest.json
test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot
scylla-nodetool: Add "cluster snapshot" command
api::storage_service: Add tablets/snapshots command for cluster level snapshot
db::snapshot-ctl: Add method to do snapshot using topo coordinator
storage_proxy: Add snapshot_keyspace method
topology_coordinator: Add handler for snapshot_tables
storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS
messaging_service: Add SNAPSHOT_WITH_TABLETS verb
feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
topology_mutation: Add setter for snapshot part of row
system_keyspace::topology_requests_entry: Add snapshot info to table
topology_state_machine: Add snapshot_tables operation
topology_coordinator: Break out logic from handle_truncate_table
storage_proxy: Break out logic from request_truncate_with_tablets
test/object_store: Remove create_ks_and_cf() helper
test/object_store: Replace create_ks_and_cf() usage with standard methods
test/object_store: Shift indentation right for test cases
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.
This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
They were running in recovery to reuse existing system tables without
group0 id, but since we want to remove recovery mode we need to
re-generate the tables.
Make the actual table name a parameter and add logic to adapt to the
variant used.
Also add dump_to_log::yes to is_rows() invokation to help debuging
tests.
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.
Skip such repairs at the scheduler level for auto repair.
If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.
Fixes SCYLLADB-561
Closesscylladb/scylladb#28640
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.
Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.
Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.
Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147Closesscylladb/scylladb#28078
* github.com:scylladb/scylladb:
test: boost: add auth cache tests
auth: add cache size metrics
docs: conf: update permissions cache documentation
auth: remove old permissions cache
auth: use unified cache for permissions
auth: ldap: add permissions reload to unified cache
auth: add permissions cache to auth/cache
auth: add service::revoke_all as main entry point
auth: explicitly life-extend resource in auth_migration_listener
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling
No backport needed since it is only refactoring of the existing code
Closesscylladb/scylladb#28250
* github.com:scylladb/scylladb:
exceptions: add helper to build a chain of error handlers
http: extract error classification code
aws_error: extract `retryable` from aws_error
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters
Closesscylladb/scylladb#28592
* github.com:scylladb/scylladb:
s3_client: add more constrains to the calc_part_size
s3_client: add tests for calc_part_size
s3_client: correct multipart part-size logic to respect 10k limit
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471Closesscylladb/scylladb#28615
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.
Fixes#28398Fixes#28399
When used as path elements in google storage paths, the object names need to be URL encoded. Due to
a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,
this was missed.
Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.
"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
Adds handling for this.
Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.
Closesscylladb/scylladb#28400
* github.com:scylladb/scylladb:
utils/gcp/object_storage: URL-encode object names in URL:s
utils::gcp::object_storage: Fix list object pager end condition detection
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.
Fixes: SCYLLADB-522
Closesscylladb/scylladb#28519
related PR: https://github.com/scylladb/scylladb/pull/27527
This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers
using -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well
Fixes SCYLLADB-246
Closesscylladb/scylladb#28232
* github.com:scylladb/scylladb:
test: add nightly label
test.py: support boost labels in test.py
add nightly label for test
test_foreign_reader_as_mutation_source
as an example of usinf boost labels pytest as markers
command to test :
./tools/toolchain/dbuild pytest --test-py-init --collect-only -q -m=nightly test/boost
output:
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.
Code cleanup, no backport
Closesscylladb/scylladb#28353
* github.com:scylladb/scylladb:
replica/partition_snapshot_reader: remove unused includes
partition_snapshot_reader: remove "flat" from name
mv partition_snapshot_reader.hh -> replica/
This test case was observed to take over 2 minutes to run on CI
machines, contributing to already bloated CI run times.
Disable this test in debug mode. This test checks for memtable flush
being slowed down when compaction can't keep up. So this test needs to
overwhelm the CPU by definition. On the other hand, this is not a
correctness test, there are such tests for the memtable and compaction
already, so it is not critical to run this in debug mode, it is not
expected to catch any use-after-free and such.
Closesscylladb/scylladb#28407
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:
if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.
Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353
Backport is not needed. Better to first observe its impact.
Closesscylladb/scylladb#21649
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: Check during admission if read may timeout
permit_reader::impl: Replace break with return after evicting inactive permit on timeout
reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
config: Add parameters to control reads' preemptive_abort_factor
permit_reader: Add a new state: preemptive_aborted
reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
reader_concurrency_semaphore: Remove cpu_concurrency's default value
Now that the sum function in the histogram uses true values instead of
an estimate, the test should reflect that.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When a shard on a replica is overloaded, it breaks down completely,
throughput collapses, latencies go through the roof and the
node/shard can even become completely unresponsive to new connection
attempts.
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued and thus they can time out while being in the queue or during
the execution. In the latter case, the timeout does not always
result in the read being aborted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following cryteria:
if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor;
the read is rejected and the next one from the wait list is
considered.
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
+ 0.0 means a read will never get rejected during admission
+ 1.0 means a read will immediatelly get rejected during admission
Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
Fixes#28398
When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.
Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
Closesscylladb/scylladb#27891
* github.com:scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.
The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.
Closesscylladb/scylladb#28340
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.
This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.
Enanching testing code, not backporting
Closesscylladb/scylladb#28352
* github.com:scylladb/scylladb:
code: Move limiting data source to test/lib
util: Simplify limiting_data_source API
util: Remove buffer_input_stream
test: Use seastar::util::temporary_buffer_data_source in data consumer test
sstables: Use seastar::util::as_input_stream() in mx reader
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.
I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28324
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.
Also move the code to the replica namespace.
test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.
Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.
This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.
Closesscylladb/scylladb#28240
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.
Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.
Flow of decommission/removenode request:
1) pending and paused, has tablet replicas on target node.
Tablet scheduler will start draining tablets.
2) No tablets on target node, request is pending but not paused
3) Request is scheduled, node is in transition
4) Request is done
Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.
When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).
Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.
Fixes#21452Closesscylladb/scylladb#24129
* https://github.com/scylladb/scylladb:
docs: Document parallel decommission and removenode and relevant task API
test: Add tests for parallel decommission/removenode
test: util: Introduce ensure_group0_leader_on()
test: tablets: Check that there are no migrations scheduled on draining nodes
test: lib: topology_builder: Introduce add_draining_request()
topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
tablets: topology_coordinator: Refactor to propagate reason for migration rollback
tablet_allocator: Skip co-location on draining nodes
node_ops: task_manager_module: Populate entity field also for active requests
tasks: node_ops: Put node id in the entity field
tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
topology: Protect against empty cancelation reason
tasks, topology: Make pending node operations abortable
doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
raft_topology, tablets: Drain tablets in parallel with other topology operations
virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
locator: topology: Add "draining" flag to a node
topology_coordinator: Extract generate_cancel_request_update()
storage_service: Drop dependency in topology_state_machine.hh in the header
locator: Extract common code in assert_rf_rack_valid_keyspace()
topology_coordinator, storage_service: Validate node removal/decommission at request submission time