Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
1) tablets have the same size
2) shards have the same capacity
That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.
After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.
One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.
Refs #23042
Closes scylladb/scylladb#23079
* github.com:scylladb/scylladb:
tablets: Make load balancing capacity-aware
topology_coordinator: Fix confusing log message
topology_coordinator: Refresh load stats after adding a new node
topology_coordinator: Allow capacity stats to be refreshed with some nodes down
topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
test: boost: tablets_test: Always provide capacity in load_stats
test: perf_load_balancing: Set node capacity
test: perf_load_balancing: Convert to topology_builder
config, disk_space_monitor: Allow overriding capacity via config
storage_service, tablets: Collect per-node capacity in load_stats
(cherry picked from commit b1d9f80d85)
99 lines
2.8 KiB
C++
99 lines
2.8 KiB
C++
/*
|
|
* Copyright (C) 2024-present ScyllaDB
|
|
*/
|
|
|
|
/*
|
|
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include <filesystem>
|
|
#include <any>
|
|
|
|
#include <boost/signals2/connection.hpp>
|
|
#include <boost/signals2/signal_type.hpp>
|
|
#include <boost/signals2/dummy_mutex.hpp>
|
|
|
|
#include <seastar/core/abort_source.hh>
|
|
#include <seastar/core/future.hh>
|
|
#include <seastar/core/lowres_clock.hh>
|
|
#include <seastar/util/optimized_optional.hh>
|
|
#include <seastar/core/condition-variable.hh>
|
|
|
|
#include "seastarx.hh"
|
|
#include "utils/updateable_value.hh"
|
|
#include "utils/phased_barrier.hh"
|
|
|
|
namespace utils {
|
|
|
|
// Instantiated only on shard 0
|
|
class disk_space_monitor {
|
|
public:
|
|
using clock_type = lowres_clock;
|
|
using signal_type = boost::signals2::signal_type<void (), boost::signals2::keywords::mutex_type<boost::signals2::dummy_mutex>>::type;
|
|
using signal_callback_type = std::function<future<>(const disk_space_monitor&)>;
|
|
using signal_connection_type = boost::signals2::scoped_connection;
|
|
using space_source_fn = std::function<future<std::filesystem::space_info>()>;
|
|
|
|
struct config {
|
|
scheduling_group sched_group;
|
|
updateable_value<int> normal_polling_interval;
|
|
updateable_value<int> high_polling_interval;
|
|
// Use high_polling_interval above this threshold
|
|
updateable_value<float> polling_interval_threshold;
|
|
updateable_value<uint64_t> capacity_override; // 0 means no override.
|
|
};
|
|
|
|
private:
|
|
abort_source _as;
|
|
optimized_optional<abort_source::subscription> _as_sub;
|
|
future<> _poller_fut = make_ready_future();
|
|
condition_variable _poll_cv;
|
|
utils::phased_barrier _signal_barrier;
|
|
signal_type _signal_source;
|
|
std::filesystem::space_info _space_info;
|
|
std::filesystem::path _data_dir;
|
|
config _cfg;
|
|
space_source_fn _space_source;
|
|
std::any _capacity_observer;
|
|
|
|
public:
|
|
disk_space_monitor(abort_source& as, std::filesystem::path data_dir, config cfg);
|
|
~disk_space_monitor();
|
|
|
|
future<> start();
|
|
|
|
future<> stop() noexcept;
|
|
|
|
const std::filesystem::path& data_dir() const noexcept {
|
|
return _data_dir;
|
|
}
|
|
|
|
std::filesystem::space_info space() const noexcept {
|
|
return _space_info;
|
|
}
|
|
|
|
float disk_utilization() const noexcept {
|
|
return _space_info.capacity ? (float)(_space_info.capacity - _space_info.available) / _space_info.capacity : -1;
|
|
}
|
|
|
|
signal_connection_type listen(signal_callback_type callback);
|
|
|
|
// Replaces default way of obtaining file system usage information.
|
|
void set_space_source(space_source_fn space_source) {
|
|
_space_source = std::move(space_source);
|
|
}
|
|
|
|
void trigger_poll() noexcept;
|
|
|
|
private:
|
|
future<> poll();
|
|
|
|
future<std::filesystem::space_info> get_filesystem_space();
|
|
|
|
clock_type::duration get_polling_interval() const noexcept;
|
|
};
|
|
|
|
} // namespace utils
|