scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 11:10:40 +00:00

Author	SHA1	Message	Date
Benny Halevy	edce417036	token_metadata_impl: clear_gently: release version tracker early No need to wait for all members to be cleared gently. We can release the version earlier since the held version may be awaited for in barriers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `6e4803a750`)	2025-07-21 09:49:05 +03:00
Benny Halevy	29c33cb065	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2c0bafb934`)	2025-07-21 09:36:40 +03:00
Benny Halevy	4da9539831	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2b2cfaba6e`)	2025-07-21 09:36:40 +03:00
Benny Halevy	390ca79ae4	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-21 09:36:40 +03:00
Benny Halevy	a59a1b422f	locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction Sort all tablet_map_ptr:s by shard_id and then destroy them on each shard to prevent long cross-shard task queues for foreign_ptr destructions. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `3acca0aa63`)	2025-07-21 09:36:40 +03:00
Asias He	67375ecf14	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561 (cherry picked from commit `c5a136c3b5`)	2025-07-16 07:43:39 +08:00
Michael Litvak	c11a2e2aaf	tablets: deallocate storage state on end_migration When a tablet is migrated and cleaned up, deallocate the tablet storage group state on `end_migration` stage, instead of `cleanup` stage: * When the stage is updated from `cleanup` to `end_migration`, the storage group is removed on the leaving replica. * When the table is initialized, if the tablet stage is `end_migration` then we don't allocate a storage group for it. This happens for example if the leaving replica is restarted during tablet migration. If it's initialized in `cleanup` stage then we allocate a storage group, and it will be deallocated when transitioning to `end_migration`. This guarantees that the storage group is always deallocated on the leaving replica by `end_migration`, and that it is always allocated if the tablet wasn't cleaned up fully yet. It is a similar case also for the pending replica when the migration is aborted. We deallocate the state on `revert_migration` which is the stage following `cleanup_target`. Previously the storage group would be allocated when the tablet is initialized on any of the tablet replicas - also on the leaving replica, and when the tablet stage is `cleanup` or `end_migration`, and deallocated during `cleanup`. This fixes the following issue: 1. A migrating tablet enters cleanup stage 2. the tablet is cleaned up successfuly 3. The leaving replica is restarted, and allocates storage group 4. tablet cleanup is not called because it was already cleaned up 4. the storage group remains allocated on the leaving replica after the migration is completed - it's not cleaned up properly. Fixes scylladb/scylladb#23481 (cherry picked from commit `34f15ca871`)	2025-06-17 13:59:10 +00:00
Dawid Mędrek	7986ef73da	locator/production_snitch_base: Reduce log level when property file incomplete We're reducing the log level in case the provided property file is incomplete. The rationale behind this change is related to how CCM interacts with Scylla: * The `GossipingPropertyFileSnitch` reloads the `cassandra-rackdc.properties` configuration every 60 seconds. * When a new node is added to the cluster, CCM recreates the `cassandra-rackdc.properties` file for EVERY node. If those two processes start happening at about the same time, it may lead to Scylla trying to read a not-completely-recreated file, and an error will be produced. Although we would normally fix this issue and try to avoid the race, that behavior will be no longer relevant as we're making the rack and DC values immutable (cf. scylladb/scylladb#23278). What's more, trying to fix the problem in the older versions of Scylla could bring a more serious regression. Having that in mind, this commit is a compromise between making CI less flaky and having minimal impact when backported. We do the same for when the format of the file is invalid: the rationale is the same. We also do that for when there is a double declaration. Although it seems impossible that this can stem from the same scenario the other two errors can (since if the format of the file is valid, the error is justified; if the format is invalid, it should be detected sooner than a doubled declaration), let's stay consistent with the logging level. Fixes scylladb/scylladb#20092 Closes scylladb/scylladb#23956 (cherry picked from commit `9ebd6df43a`) Closes scylladb/scylladb#24143	2025-05-16 11:51:23 +03:00
Tomasz Grabiec	1e407ab4d2	tablets: Equalize per-table balance when allocating tablets for a new table Fixes the following scenario: 1. Scale out adds new nodes to each rack 2. Table is created - all tablets are allocated to new nodes because they have low load 3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed We're wrong to try to equalize global load when allocating tablets, and we should equalize per-table load instead, and let background load balancing fix it in a fair way. It will add to the allocated storage imbalance, but: 1. The table is initially empty, so doesn't impact actual storage imbalance. 2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately. 3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch. 4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in. Before we have CPU-aware tablet allocation, and thus can prove we have CPU capacity on the small nodes, we should respect per-table balance as this is the way in which we achieve full CPU utilization. Fixes #23631	2025-04-17 16:01:23 +02:00
Tomasz Grabiec	2597a7e980	load_sketch: Tolerate missing tablet_map when selecting for a given table To simplify future usage in network_topology_strategy::add_tablets_in_dc() which invokes populate() for a given table, which may be both new and preexisitng.	2025-04-17 16:01:16 +02:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Tomasz Grabiec	b5211cca85	Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk Currently, when we rebuild a tablet, we stream data from all replicas. This creates a lot of redundancy, wastes bandwidth and CPU resources. In this series, we split the streaming stage of tablet rebuild into two phases: first we stream tablet's data from only one replica and then repair the tablet. Fixes: https://github.com/scylladb/scylladb/issues/17174. Needs backport to 2025.1 to prevent out of space during streaming Closes scylladb/scylladb#23187 * github.com:scylladb/scylladb: test: add test for rebuild with repair locator: service: move to rebuild_v2 transition if cluster is upgraded locator: service: add transition to rebuild_repair stage for rebuild_v2 locator: service: add rebuild_repair tablet transition stage locator: add maybe_get_primary_replica locator: service: add rebuild_v2 tablet transition kind gms: add REPAIR_BASED_TABLET_REBUILD cluster feature	2025-04-09 21:35:37 +02:00
Benny Halevy	dfdca2d84e	locator: topology: drop unused calculate_datacenters Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23647	2025-04-08 19:04:56 +03:00
Aleksandra Martyniuk	acd32b24d3	locator: service: move to rebuild_v2 transition if cluster is upgraded If cluster is upgraded to version containing rebuild_v2 transition kind, move to this transition kind instead of rebuild.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	4a847df55c	locator: service: add rebuild_repair tablet transition stage Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. rebuild_repair is a stage that will be used to perform the repair phase. It executes the tablet repair on tablet_info::replicas. A primary replica out of migration_streraming_info::read_from is the repair master. If the repair succeeds, we move to streaming tablet transition stage, and to cleanup_target - if it fails. The repair bypasses the tablet repair scheduler and it does not update the repair_time. A transition to the rebuild_repair stage will be added in the following patches.	2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk	5d6041617b	locator: add maybe_get_primary_replica Add maybe_get_primary_replica to choose a primary replica out of custom replica set.	2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk	ed7b8bb787	locator: service: add rebuild_v2 tablet transition kind Currently, in the streaming stage of rebuild tablet transition, we stream tablet data from all replicas. This patch series splits the streaming stage into two phases: - repair phase, where we repair the tablet; - streaming phase, where we stream tablet data from one replica. To differentiate the two streaming methods, a new tablet transition kind - rebuild_v2 - is added. The transtions and stages for rebuild_v2 transition kind will be added in the following patches.	2025-04-08 10:42:01 +02:00
Botond Dénes	1198213000	Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec Before, it was equalizing per-node load (tablet count), which is wrong in heterogeneous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378 Closes scylladb/scylladb#23478 * github.com:scylladb/scylladb: tablets: Make tablet allocation equalize per-shard load tablets: load_balancer: Fix reporting of total load per node	2025-04-03 16:32:53 +03:00
Gleb Natapov	4609bbbbb2	treewide: move gossiper to index nodes by host id This patch changes gossiper to index nodes by host ids instead of ips. The main data structure that changes is _endpoint_state_map, but this results in a lot of changes since everything that uses the map directly or indirectly has to be changed. The big victim of this outside of the gossiper itself is topology over gossiper code. It works on IPs and assumes the gossiper does the same and both need to be changed together. Changes to other subsystems are much smaller since they already mostly work on host ids anyway.	2025-03-31 16:50:50 +03:00
Tomasz Grabiec	6bff596fce	tablets: Make tablet allocation equalize per-shard load Before, it was equalizing per-node load (tablet count), which is wrong in heterogenous clusters. Nodes with fewer shards will end up with overloaded shards. Refs #23378	2025-03-31 14:34:30 +02:00
Dawid Mędrek	32879ec0d5	db/config: Introduce RF-rack-valid keyspaces We introduce a new term in the glossary: RF-rack-valid keyspace. We also highlight in our user documentation that all keyspaces must remain RF-rack-valid throughout their lifetime, and failing to guarantee that may result in data inconsistencies or other issues. We base that information on our experience with materialized views in keyspaces using tablets, even though they remain an experimental feature. Along with the new term, we introduce a new configuration option called `rf_rack_valid_keyspaces`, which, when enabled, will enforce preserving all keyspaces RF-rack-valid. That functionality will be implemented in upcoming commits. For now, we materialize the restriction in form of a named requirement: a function verifying that the passed keyspace is RF-rack-valid. The option is disabled by default. That will change once we adjust the existing tests to the new semantics. Once that is done, the option will first be enabled by default, and then it will be removed. Fixes scylladb/scylladb#20356	2025-03-19 14:46:35 +01:00
Tomasz Grabiec	c4714180cc	tablets: Make load balancing capacity-aware Before this patch the load balancer was equalizing tablet count per shard, so it achieved balance assuming that: 1) tablets have the same size 2) shards have the same capacity That can cause imbalance of utilization if shards have different capacity, which can happen in heterogenous clusters with different instance types. One of the causes for capacity difference is that larger instances run with fewer shards due to vCPUs being dedicated to IRQ handling. This makes those shards have more disk capacity, and more CPU power. After this patch, the load balancer equalizes shard's storage utilization, so it no longer assumes that shards have the same capacity. It still assummes that each tablet has equal size. So it's a middle step towards full size-aware balancing. One consequence is that to be able to balance, the load balancer need to know about every node's capacity, which is collected with the same RPC which collects load_stats for average tablet size. This is not a significant set back because migrations cannot proceed anyway if nodes are down due to barriers. We could make intra-node migration scheduling work without capacity information, but it's pointless due to above, so not implemented.	2025-03-06 13:35:38 +01:00
Tomasz Grabiec	7e7f1e6f91	storage_service, tablets: Collect per-node capacity in load_stats New RPC is introduced becuase load_stats was marked "final" in the IDL. Will be needed by capacity-aware load balancing.	2025-03-06 12:17:32 +01:00
Aleksandra Martyniuk	2b538d228c	locator: add round-robin selection of filtered replicas	2025-02-28 12:32:55 +01:00
Aleksandra Martyniuk	fe4e99d7b3	locator: add tablet_task_info::selected_by_filters Extract dcs and hosts filters check to a method.	2025-02-28 12:02:21 +01:00
Avi Kivity	d99df7af6c	Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec This series achieves two things: 1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards This will result in new tables having at least 10 tablet replicas per shard by default. We want this to reduce tablet load imbalance due to differences in tablet count per shard, where some shards have 1 tablet and some shards have 2 tablets. With higher tablet count per shard, this difference-by-one is less relevant. Fixes https://github.com/scylladb/scylladb/issues/21967 2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count The per-shard goal is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. The scaling is applied after computing desired tablet count due to all other factors: per-table tablet count hints, defaults, average tablet size. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. When creating a new table, its tablet count is determined by tablet scheduler using the scheduler logic, as if the table was already created. So any scaling due to per-shard tablet count goal is reflected immediately when creating a table. It may however still take some time for the system to shrink existing tables. We don't reject requests to create new tables. Fixes #21458 Closes scylladb/scylladb#22522 * github.com:scylladb/scylladb: config, tablets: Allow tablets_initial_scale_factor to be a fraction test: tablets_test: Test scaling when creating lots of tables test: tablets_test: Test tablet count changes on per-table option and config changes test: tablets_test: Add support for auto-split mode test: cql_test_env: Expose db config config: Make tablets_initial_scale_factor live-updateable tablets: load_balancer: Pick initial_scale_factor from config tablets, load_balancer: Fix and improve logging of resize decisions tablets, load_balancer: Log reason for target tablet count tablets: load_balancer: Move hints processing to tablet scheduler tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table tablets: load_balancer: Determine desired count from size separately from count from options tablets: load_balancer: Determine resize decision from target tablet count tablets: load_balancer: Allow splits even if table stats not available tablets: load_balancer: Extract make_sizing_plan() tablets: Add formatter for resize_decision::way_type tablets: load_balancer: Simplify resize_urgency_cmp() tablets: load_balancer: Keep config items as instance members locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard tablets: Set default initial tablet count scale to 10 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() tablets: load_balancer: Extract get_schema_and_rs() tablets: load_balancer: Drop test_mode	2025-02-24 17:59:26 +02:00
Raphael S. Carvalho	4d8a333a7f	storage_service: Don't retry split when table is dropped The split monitor wasn't handling the scenario where the table being split is dropped. The monitor would be unable to find the tablet map of such a table, and the error would be treated as a retryable one causing the monitor to fall into an endless retry loop, with sleeps in between. And that would block further splits, since the monitor would be busy with the retries. The fix is about detecting table was dropped and skipping to the next candidate, if any. Fixes #21859. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#22933	2025-02-20 10:13:55 +01:00
Tomasz Grabiec	029505b179	tablets: load_balancer: Move hints processing to tablet scheduler Hints have common meaning for all strategies, so the logic belongs more to make_sizing_plan(). As a side effect, we can reuse shard capacity computation across tables, which reduces computational complexity from O(tablesnodes) to O(tables DCs + nodes)	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	f1bda8d4c1	tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal The limit is enforced by controlling average per-shard tablet replica count in a given DC, which is controlled by per-table tablet count. This is effective in respecting the limit on individual shards as long as tablet replicas are distributed evenly between shards. There is no attempt to move tablets around in order to enforce limits on individual shards in case of imbalance between shards. If the average per-shard tablet count exceeds the limit, all tables which contribute to it (have replicas in the DC) are scaled down by the same factor. Due to rounding up to the nearest power of 2, we may overshoot the per-shard goal by at most a factor of 2. If different DCs want different scale factors of a given table, the lowest scale factor is chosen for a given table. The limit is configurable. It's a global per-cluster config which controls how many tablet replicas per shard in total we consider to be still ok. It controls tablet allocator behavior, when choosing initial tablet count. Even though it's a per-node config, we don't support different limits per node. All nodes must have the same value of that config. It's similar in that regard to other scheduler config items like tablets_initial_scale_factor and target_tablet_size_in_bytes.	2025-02-19 16:29:07 +01:00
Tomasz Grabiec	94b5165ac7	tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table This makes decisions made by the scheduler consistent with decisions made on table creation, with regard to tablet count. We want to avoid over-allocation of tablets when table is created, which would then be reduced by the scheduler's scaling logic. Not just to avoid wasteful migrations post table creation, but to respect the per-shard goal. To respect the per-shard goal, the algorithm will no longer be as simple as looking at hints, and we want to share the algorithm between the scheduler and initial tablet allocator. So invoke the scheduler to get the tablet count when table is created.	2025-02-19 14:40:07 +01:00
Tomasz Grabiec	d3ffea77e6	tablets: load_balancer: Extract make_sizing_plan() Resize plan making will now happen in two stages: 1) Determine desired tablet counts per table (sizing plan) 2) Schedule resize decisions We need intermediate step in the resize plan making, which gives us the planned tablet counts, so that we can plug this part of the algorithm into initial tablet allocation on table construction. We want decisisons made by the scheduler to be consistent with decisions made on table creation. We want to avoid over-allocation of tablets when table is created, which would then be reduced by the scheduler. Not just to avoid wasteful migrations post table creation, but to respect the per-shard goal. To respect the per-shard goal, the algorithm will no longer be as simple as looking at hints, and we want to share the algorithm between the scheduler and initial tablet allocator. Also, this sizing plan will be later plugged into a virtual table for observability.	2025-02-19 14:40:06 +01:00
Tomasz Grabiec	33db0d4fea	tablets: Add formatter for resize_decision::way_type	2025-02-19 14:39:40 +01:00
Tomasz Grabiec	ce959818a3	locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	f043c83ba5	tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard Currently the scale is applied post rounding up of tablet count so that tablet count per shard is at least 1. In order to be able to use the scale to increase tablet count per shard, we need to apply it prior to division by RF, otherwise we will overshoot per-shard tablet replica count. Example: 4 nodes, -c1, rf=3, initial_tablets_scale=10 Before: initial_tablet_count=20, tablet-per-shard=15 After: initial_tablet_count=14, tablets-per-shard=10.5	2025-02-19 14:38:50 +01:00
Tomasz Grabiec	8eedb551b5	tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology() To insert preemption points later.	2025-02-19 14:38:49 +01:00
Botond Dénes	3439d015cb	Merge 'repair: Introduce Host and DC filter support' from Aleksandra Martyniuk Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support. Fixes https://github.com/scylladb/scylladb/issues/22417 New feature. No backport is needed. Closes scylladb/scylladb#22621 * github.com:scylladb/scylladb: test: add test to check dcs and hosts repair filter test: add repair dc selection to test_tablet_metadata_persistence repair: Introduce Host and DC filter support docs: locator: update the docs and formatter of tablet_task_info	2025-02-17 10:04:09 +02:00
Calle Wilund	342df0b1a8	network_topology_strategy/alter ks: Remove dc:s from options once rf=0 Fixes #22688 If we set a dc rf to zero, the options map will still retain a dc=0 entry. If this dc is decommissioned, any further alters of keyspace will fail, because the union of new/old options will now contained an unknown keyword. Change alter ks options processing to simply remove any dc with rf=0 on alter, and treat this as an implicit dc=0 in nw-topo strategy. This means we change the reallocate_tablets routine to not rely on the strategy objects dc mapping, but the full replica topology info for dc:s to consider for reallocation. Since we verify the input on attribute processing, the amount of rf/tablets moved should still be legal. v2: * Update docs as well. v3: * Simplify dc processing * Reintroduce options empty check, but do early in ks_prop_defs * Clean up unit test some Closes scylladb/scylladb#22693	2025-02-15 20:32:22 +02:00
Asias He	5545289bfa	repair: Introduce Host and DC filter support Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. #21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support. Fixes #22417	2025-02-14 09:13:11 +01:00
Aleksandra Martyniuk	4c75701756	docs: locator: update the docs and formatter of tablet_task_info	2025-02-14 09:13:11 +01:00
Kefu Chai	a18069fad7	service: migrate from boost::range::remove_if() to std::ranges::remove_if Replace boost::range::remove_if() with the standard library's std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-11 09:15:14 +08:00
Benny Halevy	021fc3c756	tablets: resize_decision: get rid of initial_decision Now, with tablet_hints calculation of min_tablet_count it is not used anymore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 18:43:47 +02:00
Benny Halevy	20c6ca2813	tablet_allocator: consider tablet options for resize decision Do not merge tablets if that would drop the tablet_count below the minimum provided by hints. Split tablets if the current tablet_count is less than the minimum tablet count calculated using the table's tablet options. TODO: override min_tablet_count if the tablet count per shard is greater than the maximum allowed. In this case the tables tablet counts should be scaled down proportionally. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 18:43:35 +02:00
Benny Halevy	32c2f7579f	network_topology_strategy: allocate_tablets_for_new_table: consider tablet options Use the keyspace initial_tablets for min_tablet_count, if the latter isn't set, then take the maximum of the option-based tablet counts: - min_tablet_count - and expected_data_size_in_gb / target_tablet_size - min_per_shard_tablet_count (via calculate_initial_tablets_from_topology) If none of the hints produce a positive tablet_count, fall back to calculate_initial_tablets_from_topology * initial_scale. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:32 +02:00
Benny Halevy	86bcf4cffe	network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner Current implementation is inefficient as it calls get_datacenter_token_owners_ips and then find_node(ep) while for_each_node easily provides a host_id for is_normal_token_owner. Then, since we're interested only in datacenters configure with a replication factor (but it still might be 0), simply iterate over the dc->rf map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:59:30 +02:00
Benny Halevy	49dacb1d52	network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0 Currently, if a datacenter has no replication_factor option we consider its replication factor to be 1 in calculate_initial_tablets_from_topology, but since we're not going to have any replica on it, it should be 0. This is very minor since in the worst case, it will pessimize the calculation and calculate a value for initial_tablets that's higher than it could be. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-02-06 08:55:51 +02:00
Tomasz Grabiec	3bb19e9ac9	locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables For example, nodes which are being decommissioned should not be consider as available capacity for new tables. We don't allocate tablets on such nodes. Would result in higher per-shard load then planned. Closes scylladb/scylladb#22657	2025-02-05 23:59:41 +02:00
Tomasz Grabiec	e22e3b21b1	locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes In that case, new_racks will be used, but when we discover no candidates, we try to pop from existing_racks. Fixes #22625 Closes scylladb/scylladb#22652	2025-02-05 20:13:05 +02:00
Pavel Emelyanov	83f3821f99	Merge 'cql: clean the code validating replication strategy options' from Piotr Smaron Clean the code validating if a replication strategy can be used. This PR consists of a bunch of unmerged https://github.com/scylladb/scylladb/pull/20088 commits - the solution to the problem that the linked PR tried to solve has been accomplished in another PR, leaving the refactor commits unmerged. The commits introduced in this PR have already been reviewed in the old PR. No need to backport, it's just a refactor. Closes scylladb/scylladb#22516 * github.com:scylladb/scylladb: cql: restore validating replication strategies options cql: change validating NetworkTopologyStrategy tags to internal_error cql: inline abstract_replication_strategy::validate_replication_strategy cql: clean redundant code validating replication strategy options	2025-02-05 11:18:50 +03:00
Piotr Smaron	2953d3ebe0	cql: restore validating replication strategies options `validate_options` needs to be extended with `topology` parameter, because NetworkTopologyStrategy needs to validate if every explicitly listed DC is really existing. I did cut corner a bit and trimmed the message thrown when it's not the case, just to avoid passing and extra parameter (ks name) to the `validate_options` function, as I find the longer message to be a bit redundant (the driver will receive info which KS modification failed). The tests that have been commented out in the previous commit have been restored.	2025-02-04 12:27:33 +01:00
Piotr Smaron	100e8d2856	cql: change validating NetworkTopologyStrategy tags to internal_error The check for `replication_factor` tag in `network_topology_strategy::validate_options` is redundant for 2 reasons: - before we reach this part of the code, the `replication_factor` tag is replaced with specific DC names - we actually do allow for `replication_factor` tag in NetworkTopologyStrategy for keyspaces that have tablets disabled. This code is unreachable, hence changing it to an internal error, which means this situation should never occur. The place that unrolls `replication_factor` tag checked for presence of this tag ignoring the case, which lead to an unexpected behaviour: - `replication_factor` tag (note the lowercase) was unrolled, as explained above, - the same tag but written in any other case resulted in throwing a vague message: "replication_factor is an option for SimpleStrategy, not NetworkTopologyStrategy". So we're changing this validation to accept and unroll only the lowercase version of this tag. We can't ignore the case here, as this tag is present inside a json, and json is case-sensitive, even though the CQL itself is case insensitive. Added a test that passes for both scylla and cassandra. Fixes: #15336	2025-02-04 12:27:29 +01:00

1 2 3 4 5 ...

997 Commits