Update the regular expression in `check_node_log_for_failed_mutations` to avoid
false test failures when DEBUG-level logging is enabled.
Fixesscylladb/scylladb#23688Closesscylladb/scylladb#23658
Fixes#23774
Test code falls into same when_all issue as http client did.
Avoid passing exceptions through this, and instead catch and
report in worker lambda.
Closesscylladb/scylladb#23778
Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery
Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go
No need to backport since we are adding new functionality for a future use
Closesscylladb/scylladb#23779
* github.com:scylladb/scylladb:
s3_client: implement S3 copy object
s3_client: improve exception message
s3_client: reposition local function for future use
This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads.
To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations.
Refs #22460fixes: #22520
```
===========================================
Release build, master, smp-16, mem-32GiB
Bytes: 2342880184, backup time: 9.51 s
===========================================
Release build, this PR, smp-16, mem-32GiB
Bytes: 2342891015, backup time: 1.23 s
===========================================
```
Looks like it is faster at least x7.7
No backport needed since it (native backup) is still unused functionality
Closesscylladb/scylladb#23727
* github.com:scylladb/scylladb:
backup: Add test for invalid endpoint
backup_task: upload on all shards
backup_task: integrate sharded storage manager for upload
Commit 14bf09f447 added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator).
In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2.
2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew.
But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
This is a regression fix, should be backported to all affected releases.
Closes scylladb/scylladb#23782
* github.com:scylladb/scylladb:
managed_bytes_test: add a reproducer for #23781
managed_bytes: in the copy constructor, respect the target preferred allocation size
There are two tests which test incremental read repair: one with row the other with partition tombstones. The tests currently force vnodes, by creating the test keyspace with {'enabled': false}. Even so, the tests were found to be flaky so one of them are marked for skip. This commit does the following changes:
* Make the tests use tablets by creating the test keyspace with tablets.
* Change the way the tests write data so it works with tablets: currently the tests use scylla-sstable write + upload but this won't work with tablets since upload with tablets implies --load-and-stream which means data is streamed to all replicas (no difference created between nodes). Switch to the classic stop-node + write to other replica with CL=ONE.
* Remove the skip added to the partition-tombstone test variant.
Fixes: #21179
Test improvement, no backport required.
Closesscylladb/scylladb#23167
* github.com:scylladb/scylladb:
wip
test/cluster/test_read_repair: make incremental test work with tablets
* During the development phase, the backup functionality broke because we lacked a test that runs backup with an invalid endpoint. This commit adds a test to cover that scenario.
* Add checking for the expected error to be propagated from failing/aborted backup
Use all shards to upload snapshot files to S3.
By using the sharded sstables_manager_for_table
infrastructure.
Refs #22460
Quick perf comparison
===========================================
Release build, master, smp-16, mem-32GiB
Bytes: 2342880184, backup time: 9.51 s
===========================================
Release build, this PR, smp-16, mem-32GiB
Bytes: 2342891015, backup time: 1.23 s
===========================================
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.
Fixes#23755
* The issue exists since 6.2
Closesscylladb/scylladb#17712
* github.com:scylladb/scylladb:
topology_coordinator: stop: await all background_action_holder:s
topology_coordinator: stop: improve error messages
topology_coordinator: stop: define stop_background_action helper
Instead of hardcoding PR_NUM=$1 and FORCE=$2. This current setup is
not very flexible and one gets no feedback if the arguments are
incorrect or not recognized.
Add proper position-independent argument parsing using a classic while
case loop.
Closesscylladb/scylladb#23623
This series adds support for reporting consumed capacity in BatchGetItem operations in Alternator.
It includes changes to the RCU accounting logic, exposing internal functionality to support batch-specific behavior, and adds corresponding tests for both simple and complex use cases involving multiple tables and consistency modes.
Need backporting to 2025.1, as RCU and WCU are not fully supported
Fixes#23690Closesscylladb/scylladb#23691
* github.com:scylladb/scylladb:
test_returnconsumedcapacity.py: test RCU for batch get item
alternator/executor: Add RCU support for batch get items
alternator/consumed_capacity: make functionality public
Add support for the CopyObject API to enable direct copying of S3
objects between locations. This approach eliminates networking
overhead on the client side, as the operation is handled internally
by S3.
There are two tests which test incremental read repair: one with row the
other with partition tombstones. The tests currently force vnodes, by
creating the test keyspace with {'enabled': false}. Even so, the tests
were found to be flaky so one of them are marked for skip.
This commit does the following changes:
* Make the tests use tablets by creating the test keyspace with tablets.
* Change the way the tests write data so it works with tablets:
currently the tests use scylla-sstable write + upload but this won't
work with tablets since upload with tablets implies --load-and-stream
which means data is streamed to all replicas (no difference created
between nodes). Switch to the classic stop-node + write to other
replica with CL=ONE.
* Remove the skip added to the partition-tombstone test variant.
Also add tracing to the read-repair query, to make debugging the test
easier if it fails.
Fixes: #21179
Commit 14bf09f447 added a single-chunk
layout to `managed_bytes`, which makes the overhead of `managed_bytes`
smaller in the common case of a small buffer.
But there was a bug in it. In the copy constructor of `managed_bytes`,
a copy of a single-chunk `managed_bytes` is made single-chunk too.
But this is wrong, because the source of the copy and the target
of the copy might have different preferred max contiguous allocation
sizes.
In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB
is copied from the standard allocator into LSA, the resulting
`managed_bytes` is a single chunk which violates LSA's preferred
allocation size. (And therefore is placed by LSA in the standard
allocator).
In other words, since Scylla 6.0, cache and memtable cells
between 13 kiB and 128 kiB are getting allocated in the standard allocator
rather than inside LSA segments.
Consequences of the bug:
1. Effective memory consumption of an affected cell is rounded up to the nearest
power of 2.
2. With a pathological-enough allocation pattern
(for example, one which somehow ends up placing a single 16 kiB
memtable-owned allocation in every aligned 128 kiB span),
memtable flushing could theoretically deadlock,
because the allocator might be too fragmented to let the memtable
grow by another 128 kiB segment, while keeping the sum of all
allocations small enough to avoid triggering a flush.
(Such an allocation pattern probably wouldn't happen in practice though).
3. It triggers a bug in reclaim which results in spurious
allocation failures despite ample evictable memory.
There is a path in the reclaimer procedure where we check whether
reclamation succeeded by checking that the number of free LSA
segments grew.
But in the presence of evictable non-LSA allocations, this is wrong
because the reclaim might have met its target by evicting the non-LSA
allocations, in which case memory is returned directly to the
standard allocator, rather than to the pool of free segments.
If that happens, the reclaimer wrongly returns `reclaimed_nothing`
to Seastar, which fails the allocation.
Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072
Fixes https://github.com/scylladb/scylladb/issues/22941
Fixes https://github.com/scylladb/scylladb/issues/22389
Fixes https://github.com/scylladb/scylladb/issues/23781
As noticed in issue #23687, if we shut down Scylla while a paged read is
in progress - or even a paged read that the client had no intention of
ever resume it - the shutdown pauses for 10 seconds.
The problem was the stop() order - we must stop the "querier cache"
before we can close sstables - the "querier cache" is what holds paged
readers alive waiting for clients to resume those reads, and while a
reader is alive it holds on to sstables so they can't be closed. The
querier cache's querier_cache::default_entry_ttl is set to 10 seconds,
which is why the shutdown was un-paused after 10 seconds.
This fix in this patch is obvious: We need to stop the querier cache
(and have it release all the readers it was holding) before we close
the sstables.
Fixes#23687
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23770
Continue the effort of normalizing reader names, stripping legacy qualifying terms like "flat" and "v2".
Flat and v2 readers are the default now, we only need to add qualifying terms to readers which are different than the normal.
One such reader remains: `make_generating_reader_v1()`.
This PR contains mostly mechanical changes, done with a sed script. Commits which only contain such mechanical renames are marked as such in the commitlog.
Code cleanup, no backport needed.
Closesscylladb/scylladb#23767
* github.com:scylladb/scylladb:
readers: mv reversing_v2.hh reversing.hh
readers: mv generating_v2.hh generating.hh
tree: s/make_generating_reader_v2/make_generating_reader/
readers: mv from_mutations_v2.hh from_mutations.hh
tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s
readers: mv from_fragments_v2.hh from_fragments.hh
readers: mv forwardable_v2.hh forwardable.hh
readers: mv empty_v2.hh empty.hh
tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/
readers/empty_v2.hh: replace forward declarations with include of fwd header
readers/mutation_reader_fwd.hh: forward declare reader_permit
readers: mv delegating_v2.hh delegating.hh
readers/delegating_v2.hh: move reader definition to _impl.hh file
This patch adds tests for consumed capacity in batch get item. It tests
both the simple case and the multi-item, multi-table case that combines
consistent and non-consistent reads.
The db::config is top-level configuration of scylla, we generally try to
avoid using it even in scylla components: each uses its own config
initialized by the service creator out of the db::config itself. The
generic_server is not an exception, all the more so, it already has its
own config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23705
This patch adds RCU support for batch get items. With batch requests,
multiple objects are read from multiple tables. While the criterion for
adding the units is per the batch request, the units are calculated per
table—and so is the read consistency.
The consumed_capacity_counter is not completely applicable for batch
operations. This patch makes some of its functionality public so that
batch get item can use the components to decide if it needs to send
consumed capacity in the reply, to get the half units used by the
metrics and returned result, and to allow an empty constructor for the
RCU counter.
Add missing awaits for the rebuild_repair and repair background actions.
Although the background actions hold the _async_gate
which is closed in topology_coordinator::run(),
stop() still needs to await all background action futures
and handle any errors they may have left behind.
Fixes#23755
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refactor the code to use a helper to await background_action_holder
and handle any errors by printing a warning.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The idea behind readers/ is that each reader has its minimal header with
just a factory method declaration. The delegating reader is defined in
the factory header because it has a derived class in row_cache_test.cc.
Move the definition to delegating_impl.hh so users not interested in
deriving from it don't pay the price in header include cost.
* seastar 099cf616...e44af9b0 (19):
> Add assertion to `get_local_service`
> http_client: Improve handling of server response parsing errors
> util: include used header
> core: Fix module linkage by using `inline constexpr` for shared constants
> build: fix P2582R1 detection for GCC compiler compatibility
> app-template: remove production warning
> ioinfo: Extend printed data a bit more
> reactor: Fix indentation after previous patch
> reactor: Configure multiple mountpoints per disk
> io_queue, resource, reactor: Rename dev_t -> unsigned
> resource: Rename mountpoint to disk in resources
> reactor: Keep queues as shared_ptr-s
> io_queue: Drop device ID
> io_intent: Use unsigned queue id as a key
> io_queue: Keep unsigned queue id on an io_queue
> file: Keep device_id on posix file impl
> io_queue: Print mountpoint in latency goal bump message
> io_intent: Rename qid to cid
> reactor: Move engine()._num_io_groups assignment and check
Changes in io-queue call for scylla-gdb update as well -- now the
reactor map of device to io-queue uses seastar::shared_ptr, not
std::unique_ptr.
Closesscylladb/scylladb#23733
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.
For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.
Fixesscylladb/scylladb#21637
Backport: 6.2 and 6.1
Closesscylladb/scylladb#22779
* github.com:scylladb/scylladb:
Ensure raft group0 RPCs use the gossip scheduling group
Move RAFT operations verbs to GOSSIP group.
Commit 876478b84f ("storage_service: allow concurrent tablet migration in tablets/move API", 2024-02-08) introduced a code path on which the topology state machine would be busy -- in "tablet_draining" or "tablet_migration" state -- at the time of starting tablet migration. The pre-commit code would unconditionally transition the topology to "tablet_migration" state, assuming the topology had been idle previously. On the new code path, this state change would be idempotent if the topology state machine had been busy in "tablet_migration", but the state change would incorrectly overwrite the "tablet_draining" state otherwise.
Restrict the state change to when the topology state machine is idle.
In addition, add the topology update to the "updates" vector with plain push_back(). emplace_back() is not helpful here, as topology_mutation_builder::build() cannot construct in-place, and so we invoke the "canonical_mutation" move constructor once, either way.
Unit test:
Start a two node cluster. Create a single tablet on one of the nodes. Start decommissioning that node, but block decommissioning at once. In that state (i.e., in "tablet_draining"), move the tablet manually to the other node. Check that transit_tablet() leaves the topology transition state alone.
Fixes https://github.com/scylladb/scylladb/issues/20073.
Commit 876478b84f was first released in scylla-6.0.0, so we might want to backport this patch accordingly.
Closesscylladb/scylladb#23751
* github.com:scylladb/scylladb:
storage_service: add unit test for mid-decommission transit_tablet()
storage_service: preserve state of busy topology when transiting tablet
This dependency is already there, storage service doesn't need to go
rounds via database reference to get to the features.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23739
Start a two node cluster. Create a single tablet on one of the nodes.
Start decommissioning that node, but block decommissioning at once. In
that state (i.e., in "tablet_draining"), move the tablet manually to the
other node. Check that transit_tablet() leaves the topology transition
state alone.
Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>