Adjust some of the existing tests in service_level_controller_test.cc
and add some more in order to test the workload prioritization features,
i.e. the service level shares.
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
Introduce the "SHARES" keyword which can be used in conjunction with
existing CQL statements related to the service levels.
Adjust the CQL statements for service levels:
- CREATE/ALTER now allow to set shares (only if the cluster is fully
upgraded)
- LIST EFFECTIVE SERVICE LEVEL now return the number of shares in a new
column
- LIST SERVICE LEVEL(S) also return the number of shares, and has the
additional column "percentage of all service level shares"
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.
The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.
In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.
Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.
The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.
The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.
When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".
When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.
When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".
For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
This adds to the grammar the option to SELECT a specific element in a collection (map/set/list).
For example:
`SELECT map['key'] FROM table`
`SELECT map['key1']['key2'] FROM table`
This feature was implemented in Cassandra 4.0 and was requested by scylla users.
The behavior is mostly compatible with Cassandra, except:
1. in SELECT, we allow list subscript in a selector, while cassandra allows only map and set.
2. in UPDATE, we allow set subscript in a column condition, while cassandra allows only map and list.
3. the slice syntax `SELECT m[a..b]` is not implemented yet
4. null subscript - `SELECT m[null]` returns null in scylla, while cassandra returns error
Fixes#7751
backport was requested for a user to be able to use it
Closesscylladb/scylladb#22051
* github.com:scylladb/scylladb:
cql3: allow SELECT of specific collection key
cql3: allow set subscript
This is a forward port (from scylla-enterprise) of additional compression options (zstd, dictionaries shared across messages) for inter-node network traffic. It works as follows:
After the patch, messaging_service (Scylla's interface for all inter-node communication)
compresses its network traffic with compressors managed by
the new advanced_rpc_compression::tracker. Those compressors compress with lz4,
but can also be configured to use zstd as long as a CPU usage limit isn't crossed.
A precomputed compression dictionary can be fed to the tracker. Each connection
handled by the tracker will then start a negotiation with the other end to switch
to this dictionary, and when it succeeds, the connection will start being compressed using that dictionary.
All traffic going through the tracker is passed as a single merged "stream" through dict_sampler.
dictionary_service has access to the dict_sampler.
On chosen nodes (in the "usual" configuration: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the alien_worker thread) and publishes the new dictionary
to system.dicts via Raft's write_mutation command.
This update triggers (eventually) a callback on all nodes, which feeds the new dictionary
to advanced_rpc_compression::tracker, and this switches (eventually) all inter-node connections
to this dictionary.
Closesscylladb/scylladb#22032
* github.com:scylladb/scylladb:
messaging_service: use advanced_rpc_compression::tracker for compression
message/dictionary_service: introduce dictionary_service
service: make Raft group 0 aware of system.dicts
db/system_keyspace: add system.dicts
utils: add advanced_rpc_compressor
utils: add dict_trainer
utils: introduce reservoir_sampling
utils: introduce alien_worker
utils: add stream_compressor
Logging randomization parameters in the pytest_generate_tests hook doesn't
play well for us. To make these parameters more visible move the logging
to the test level.
Closesscylladb/scylladb#22055
This changeset ports LTO and PGO support from scylla-enterprise.git to scylladb.git.
Add support for Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO)
to improve performance. LTO provides ~7% performance gain and enables crucial
binary layout optimizations for PGO.
LTO Changes:
- Add `-flto` flag to compile and link steps
- Use `-ffat-lto-objects` to generate both LLVM IR and machine code
- Enable cross-object optimization while maintaining fast test linking
PGO Implementation:
- Implement three-stage build process:
1. Context-free profiling (`-fprofile-generate`)
2. Context-sensitive profiling (`-fprofile-use` + `-fcs-profile-generate`)
3. Final optimization using merged profiles
- Add release-pgo and release-cs-pgo build stages
- Integrate with ninja build system
- Stages can be enabled independently
Profile Management:
- Add `pgo/pgo.py` for workload profile collection
- Store default profile in `pgo/profiles/profile.profdata.xz` using Git LFS
- Add configure.py integration for profile detection and validation
- Support custom profiles via `--use-profile` flag
- Add profile regeneration script
Both optimizations are recommended for maximum performance, though each PGO
stage adds a full build cycle. Future optimization may allow dropping one
PGO stage if performance impact is minimal.
---
this is a forward port, hence no need to backport.
Closesscylladb/scylladb#22039
* github.com:scylladb/scylladb:
build: cmake: add CMake options for PGO support
build: cmake: add "Scylla_ENABLE_LTO" option
build: set LTO and PGO flags for Seastar in cmake build
build: collect scylla libraries with `scylla_libs` variable
build: Unify Abseil CXX flags configuration
configure.py: prepare the build for a default PGO profile in version control
configure.py: introduce profile-guided optimization
pgo: add alternator workloads training
pgo: add a repair workload
pgo: add a counters workload
pgo: add a secondary index workload
pgo: add a LWT workload
pgo: add a decommission workload
pgo: add a clustering workload
pgo: add a basic workload
pgo: introduce a PGO training script
configure.py: don't include non-default modes in dist-server-* rules
configure.py: enable LTO in release builds by default
configure.py: introduce link-time optimization
configure.py: add a `default` to `add_tristate`.
configure.py: unify build rules for cxxbridge .cc files and regular .cc files
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.
Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786
This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().
- storage_service: replicate_to_all_cores: update base and view tables atomically
Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)
To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.
* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2
Closesscylladb/scylladb#21781
* github.com:scylladb/scylladb:
storage_service: replicate_to_all_cores: clear_gently pending erms
test_mv_topology_change: drop delay_after_erm_update injection case
storage_service: replicate_to_all_cores: update base and view tables atomically
table: make update_effective_replication_map sync again
This adds to the grammar the option to SELECT a specific key in a
collection column using subscript syntax.
For example:
SELECT map['key'] FROM table
SELECT map['key1']['key2'] FROM table
The key can also be parameterized in a prepared query. For this we need
to pass the query options to result_set_builder where we process the
selectors.
Fixesscylladb/scylladb#7751
If authentication is enabled, but STARTUP isn't followed by REGISTER (which is optional, and in practice only happens on only one of a driver's connections — because there's no point listening for the same events on multiple connections), connections are wrongly displayed in the system.clients as AUTHENTICATING instead of READY, even when they are ready.
This commit fixes this problem.
Fixes: scylladb/scylladb#12640Closesscylladb/scylladb#21774
`record_property` generates XML which is not compatible with xunit2,
so pytest decided to deprecated when the generating xunit reports.
and pytest generates following warning when a test failure is
reported using this fixture:
```
object_store/test_backup.py:337: PytestWarning: record_property is incompatible with junit_family 'xunit2' (use 'legacy' or 'xunit1')
```
this warning is not related to the test, but more about how we
report a failure using pytrest. it is distracting, so let's silence it.
See also https://github.com/pytest-dev/pytest/issues/5202
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22067
There are many CI failures (repros of https://github.com/scylladb/scylladb/issues/21534) which caused by `stop_after_setting_mode_to_normal_raft_topology` and `stop_before_becoming_raft_voter` error injections in combination with some cluster events.
Need to deselect them for now to make CI more stable. First batch deselected in https://github.com/scylladb/scylladb/pull/21658
Also, add the handling of topology state rollback caused by `stop_before_streaming` or `stop_after_updating_cdc_generation` error injections as a separate commit.
See also https://github.com/scylladb/scylladb/issues/21872 and https://github.com/scylladb/scylladb/issues/21957
Closes scylladb/scylladb#22044
* github.com:scylladb/scylladb:
test.py: topology_random_failures: more deselects for #21534
test.py: topology_random_failures: handle more node's hangs during 30s sleep
This patch sets up an `alien_worker`, `advanced_rpc_compression::tracker`,
`dict_sampler` and `dictionary_service` in `main()`, and wires them to each other
and to `messaging_service`.
`messaging_service` compresses its network traffic with compressors managed by
the `advanced_rpc_compression::tracker`. All this traffic is passed as a single
merged "stream" through `dict_sampler`.
`dictionary_service` has access to `dict_sampler`.
On chosen nodes (by default: the Raft leader), it uses the sampler to maintain
a random multi-megabyte sample of the sampler's stream. Every several minutes,
it copies the sample, trains a compression dictionary on it (by calling zstd's
training library via the `alien_worker` thread) and publishes the new dictionary
to `system.dicts` via Raft.
This update triggers a callback into `advanced_rpc_compression::tracker` on all nodes,
which updates the dictionary used by the compressors it manages.
add an option named "Scylla_ENABLE_LTO", which is off by default.
if it is on, build the whole tree with ThinLTO enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::to`.
in this change, we:
- replace `boost::copy_range` to `std::ranges::to`
- remove unused `#include` of boost headers
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21880
When embedding HTML documents in pytest reports with links to test artifacts,
parameterized test names containing special characters like "[" and "]" can
cause URL encoding issues. These characters, when used verbatim in URLs, can
trigger HTTP 400 errors on web servers.
This commit resolves the issue by percent-encoding the URLs for artifact links,
ensuring compatibility with servers like Jenkins and preventing "HTTP ERROR 400
Illegal Path Character" errors.
Changes:
- Percent-encode test artifact URLs to handle special characters
- Improve link robustness for parameterized test names
Fixesscylladb/scylla-pkg#4599
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21963
This enables tablets in topology_custom, so explicitly
disable them where tests don't support tablets.
In scope of this rename patch a few imports.
Importing dependencies from another test is a bad idea -
please use shared libraries instead.
Fixed#20193Closesscylladb/scylladb#22014
The main purpose of this change is to enhance the restore from object storage usage.
Currently, restore uses the load-and-stream facility. When triggered, the restoring task opens the provided list of sstables directory from the remote bucket and then feeds the list of sstables to load_and_stream() method. The method, in turn, iterates over this list, reads mutations and for each mutation decides where to send one by checking the replication map (it's pretty much the same for both vnodes and tablets, but for tablets that are "fully contained" by a range there's the plan to stream faster).
As described above, restore is governed by a single node and this single node reads all sstables from the object store, which can be very slow. This PR allows speeding things up. For that, the load-and-stream code is equipped with the "scope" filter which limits where mutations can be streamed to. There are four options for that -- all, dc, rack and node. The "all" is how things work currently, "dc" and "rack" filter out target nodes that don't belong to this node's dc/rack respectively. The "node" scope only streams mutations to local node.
With the "node" scope it's possible to make all nodes in the cluster load mutations that belong to them in parallel, without re-sending them to peers. The last patch in this PR is the test that shows how it can be possible.
Closesscylladb/scylladb#21169
* github.com:scylladb/scylladb:
test: Add scope-streaming test (for restore from backup)
api: New "scope" API param to load-and-stream calls
sstables_loader: Propagate scope from API down
sstables_loader: Filter tablets based on scope
streamer: Disable scoped streaming of primary replica only
sstables_loader: Introduce streaming scope
sstables_loader: Wrap get_endpoints()
It turned out that aforementioned APIs use slightly different sources of information about view build progress/status which sometimes results in different reporting of whether an index is built. It's good to make those two APIs consistent. Also add a test for the REST API endpoint (system table test was addressed by #21677).
Closesscylladb/scylladb#21814
* github.com:scylladb/scylladb:
test: Add tests for MVs and indexes reporting by API endpoint(s)
api: Use built_views table in get_built_indexes API
The node is hanging and the coordinator just rollback a topology state. It's different from
`stop_after_sending_join_node_request` and `stop_after_bootstrapping_initial_raft_configuration`
because in these cases the coordinator just not able to start the topology change at all and
a message in the coordinator's log is different.
Error injections handled:
- `stop_after_updating_cdc_generation`
- `stop_before_streaming`
And, actually, it can be any cluster event which lasts more than 30s.
So far there's the /column_family/built_indexes one that reports the
index names similar to how system.IndexInfo does, but it's not tested.
This patch adds tests next to existing system. table ones.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To improve balancing when reading in 1 < CL < ALL
This implementation has a moderate impact on
the function performance in contrast to full
std::shuffle of the vector before stable_sort:ing it
(especially with large number of nodes to sort).
Before:
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6
After:
sort_by_proximity_topology.perf_sort_by_proximity 19689561 50.195ns 0.119ns 50.076ns 51.145ns 0.000 0.000 622.5 150.6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
And use a temporary vector to use the precalculated distances.
A later patch will add some randomization to shuffle nodes
at the same distance from the reference node.
This improves the function performance by 50% for 3 replicas,
from 77.4 ns to 39.2 ns, larger replica sets show greater improvement
(over 4X for 15 nodes):
Before:
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6
After:
sort_by_proximity_topology.perf_sort_by_proximity 25541973 39.225ns 0.114ns 38.966ns 39.339ns 0.000 0.000 588.5 116.6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
benchmark sort_by_proximity
Baseline results on my desktop for sorting 3 nodes:
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 1
random seed: 20241224
test iterations median mad min max allocs tasks inst cycles
sort_by_proximity_topology.perf_sort_by_proximity 12808773 77.368ns 0.062ns 77.300ns 77.873ns 0.000 0.000 1194.2 231.6
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Adds glue needed to pass lz4 and zstd with streaming and/or dictionaries
as the network traffic compressors for Seastar's RPC servers.
The main jobs of this glue are:
1. Implementing the API expected by Seastar from RPC compressors.
2. Expose metrics about the effectiveness of the compression.
3. Allow dynamically switching algorithms and dictionaries on a running
connection, without any extra waits.
The biggest design decision here is that the choice of algorithm and dictionary
is negotiated by both sides of the connection, not dictated unilaterally by the
sender.
The negotiation algorithm is fairly complicated (a TLA+ model validating
it is included in the commit). Unilateral compression choice would be much simpler.
However, negotiation avoids re-sending the same dictionary over every
connection in the cluster after dictionary updates (with one-way communication,
it's the only reliable way to ensure that our receiver possesses the dictionary
we are about to start using), lets receivers ask for a cheaper compression mode
if they want, and lets them refuse to update a dictionary if they don't think
they have enough free memory for that.
In hindsight, those properties probably weren't worth the extra complexity and
extra development effort.
Zstd can be quite expensive, so this patch also includes a mechanism which
temporarily downgrades the compressor from zstd to lz4 if zstd has been
using too much CPU in a given slice of time. But it should be noted that
this can't be treated as a reliable "protection" from negative performance
effects of zstd, since a downgrade can happen on the sender side,
and receivers are at the mercy of senders.
We are planning to improve some usages of compression in Scylla
(in which we compress small blocks of data) by pre-training
compression dictionaries on similar data seen so far.
For example, many RPC messages have similar structure
(and likely similar data), so the similarity could be exploited
for better compression. This can be achieved e.g. by training
a dictionary on the RPC traffic, and compressing subsequent
RPC messages against that dictionary.
To work well, the training should be fed a representative sample
of the compressible data. Such a sample can be approached by
taking a random subset (of some given reasonable size) of the data,
with uniform probability.
For our purposes, we need an online algorithm for this -- one
which can select the random k-subset from a stream of arbitrary
size (e.g. all RPC traffic over an hour), while requiring only
the necessary minimum of memory.
This is a known problem, called "reservoir sampling".
This PR introduces `reservoir_sampler`, which implements
an optimal algorithm for reservoir sampling.
Additionally, it introduces `page_sampler` -- a wrapper for `reservoir_sampler`,
which uses it to select a random sample of pages from a stream of bytes.
Adds utilities for "advanced" methods of compression with lz4
and zstd -- with streaming (a history buffer persisted across messages)
and/or precomputed dictionaries.
This patch is mostly just glue needed to use the underlying
libraries with discontiguous input and output buffers, and for reusing the
same compressor context objects across messages. It doesn't contain
any innovations of its own.
There is one "design decision" in the patch. The block format of LZ4
doesn't contain the length of the compressed blocks. At decompression
time, that length must be delivered to the decompressor by a channel
separate to the compressed block itself. In `lz4_cstream`, we deal
with that by prepending a variable-length integer containing the
compressed size to each compressed block. This is suboptimal for
single-fragment messages, since the user of lz4_cstream is likely
going to remember the length of the whole message anyway,
which makes the length prepended to the block redundant.
But a loss of 1 byte is probably acceptable for most uses.
- create
- a cluster with given topology
- keyspace with tablets and given rf value
- table with some data
- backup
- flush all nodes
- kick backup API on every node
- re-create keyspace and table
- drop it first
- create again with the same parameters and schema, but don't
populate table with data
- restore
- collect nodes to contact and corresponding list of TOCs
according to the preferred "scope"
- ask selected nodes to restore, limiting its streaming scope
and providing the specific list of sstables
- check
- select mutation fragments from all nodes for random keys
- make sure that the number of non-empty responses equals the
expected rf value
Specific topologies, RFs and stream scopes used are:
rf = 1, nodes = 3, racks = 1, dcs = 1, scope = node
rf = 3, nodes = 5, racks = 1, dcs = 1, scope = node
rf = 1, nodes = 4, racks = 2, dcs = 1, scope = rack
rf = 3, nodes = 6, racks = 2, dcs = 1, scope = rack
rf = 3, nodes = 6, racks = 3, dcs = 1, scope = rack
rf = 2, nodes = 8, racks = 4, dcs = 2, scope = dc
nodes and racks are evenly distributed in racks and dcs respectively
in the last topo RF effectively becomes 4 (2 in each dc)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two of those -- the POST /storage_service/keyspace that loads
and streams new sstables from /upload and POST /storage_service/restore
that does the same, but gets sstables from object store.
The new optional parameter allow users to tun the streaming phase
behavior. The test/pylib client part is also updated here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extract can_sort_by_proximity() out so it can be used
later by storage_proxy, and introduce do_sort_by_proximity
that sorts unconditionally.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To reduce test executable size and speed up compilation time, compile unit
tests into a single executable.
Here is a file size comparison of the unit test executable:
- Before applying the patch
$ du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
11G build/release/test/boost/
29G build/debug/test/boost/
- After applying the patch
du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
5.5G build/release/test/boost/
19G build/debug/test/boost/
It reduces executable sizes 5.5GB on release, and 10GB on debug.
Closes#9155Closesscylladb/scylladb#21443
Where the grammar supports IN, we add NOT IN. This includes the WHERE
clause and LWT IF clause.
Evaluation of NOT IN follows from IN.
In statement_restrictions analysis, they are different, as NOT IN
doesn't enable any clever query plan and must filter.
Some tests are added. An error message was changed ('in' changed to 'IN'),
so some tests are adjusted.
Closesscylladb/scylladb#21992
we already check `self.cmd` for null at the very beginning of the
`ScyllaServer.stop()`, and in the `try` block, we don't reset
`self.cmd`, hence there is no need to check it again.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21936
Currently, in tablet_map_to_mutation, repair's and migration's
tablet_task_info is always set.
Do not set the tablet_task_info if there is no running operation.
Closesscylladb/scylladb#22005
During the consolidation of per-suite pytest.ini files (commit 8bf62a086f),
the 'repair' marker was inadvertently dropped. This led to pytest warnings
for tests using the @pytest.mark.repair decorator.
This patch restores the marker declaration to eliminate the distracting
PytestUnknownMarkWarning:
```
test/topology_experimental_raft/test_tablets.py:396
/home/kefu/dev/scylladb/test/topology_experimental_raft/test_tablets.py:396: PytestUnknownMarkWarning: Unknown pytest.mark.repair - is this a typo? You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
@pytest.mark.repair
```
Restoring the marker allows tests to use the 'repair' mark without
generating warnings.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21931
This series converts the call site using compare_endpoints with gms::inet_address.
With that both flavors of compare_endpoints and sort_by_proximity for inet_address
can be retired as no other uses remain.
Also, add a unit test for topology::sort_by_proximity before further changes
to it are considered.
* Code cleanup, no backport is needed
Closesscylladb/scylladb#21976
* github.com:scylladb/scylladb:
test: network_topology_strategy_test: add test_topology_sort_by_proximity
locator/topology: retire sort_by_proximity/compare_endpoints for inet_address
test: test_topology_compare_endpoints: use host_id:s
instead of reusing the variable name and overriding the parameter,
use a new name for the return value of `manager_internal()` for better
readability.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21932
When an sstable is unlinked, it remains in the _active list of the
sstable manager. Its memory might be reclaimed and later reloaded,
causing issues since the sstable is already unlinked. This patch updates
the on_unlink method to reclaim memory from the sstable upon unlinking,
remove it from memory tracking, and thereby prevent the issues described
above.
Added a testcase to verify the fix.
Fixes#21887
This is a bug fix in the bloom filter reload/reclaim mechanism and should be backported to older versions.
Closesscylladb/scylladb#21895
* github.com:scylladb/scylladb:
sstables_manager: reclaim memory from sstables on unlink
sstables_manager: introduce reclaim_memory_and_stop_tracking_sstable()
sstables: introduce disable_component_memory_reload()
sstables_manager: log sstable name when reclaiming components
Fixes#20717
Enables abortable interface and propagates abort_source to all s3 objects used for reading the restore data.
Note: because restore is done on each shard, we have to maintain a per-shard abort source proxy for each, and do a background per-shard abort on abort call. This is synced at the end of "run()".
Abort source is added as an optional parameter to s3 storage and the s3 path in distributed loader.
There is no attempt to "clean up" an aborted restore. As we read on a mutation level from remote sstables, we should not cause incomplete sstables as such, even though we might end up of course with partial data restored.
Closesscylladb/scylladb#21567
* github.com:scylladb/scylladb:
test_backup: Add restore abort test case
sstables_loader: Make restore task abortable
distributed_loader: Add optional abort_source to get_sstables_from_object_store
s3_storage: Add optional abort_source to params/object
s3::client: Make "readable_file" abortable