To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.
rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.
The repair bypasses the tablet repair scheduler and it does not update
the repair_time.
A transition to the rebuild_repair stage will be added in the following
patches.
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.
To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.
The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
We introduce a new term in the glossary: RF-rack-valid keyspace.
We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.
Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.
The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.
Fixesscylladb/scylladb#20356
This series achieves two things:
1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards
This will result in new tables having at least 10 tablet replicas per
shard by default.
We want this to reduce tablet load imbalance due to differences in
tablet count per shard, where some shards have 1 tablet and some
shards have 2 tablets. With higher tablet count per shard, this
difference-by-one is less relevant.
Fixes https://github.com/scylladb/scylladb/issues/21967
2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count
The per-shard goal is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.
There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.
If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.
The scaling is applied after computing desired tablet count due to
all other factors: per-table tablet count hints, defaults, average tablet size.
If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.
When creating a new table, its tablet count is determined by tablet
scheduler using the scheduler logic, as if the table was already created.
So any scaling due to per-shard tablet count goal is reflected immediately
when creating a table. It may however still take some time for the system
to shrink existing tables. We don't reject requests to create new tables.
Fixes#21458Closesscylladb/scylladb#22522
* github.com:scylladb/scylladb:
config, tablets: Allow tablets_initial_scale_factor to be a fraction
test: tablets_test: Test scaling when creating lots of tables
test: tablets_test: Test tablet count changes on per-table option and config changes
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
config: Make tablets_initial_scale_factor live-updateable
tablets: load_balancer: Pick initial_scale_factor from config
tablets, load_balancer: Fix and improve logging of resize decisions
tablets, load_balancer: Log reason for target tablet count
tablets: load_balancer: Move hints processing to tablet scheduler
tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
tablets: load_balancer: Determine desired count from size separately from count from options
tablets: load_balancer: Determine resize decision from target tablet count
tablets: load_balancer: Allow splits even if table stats not available
tablets: load_balancer: Extract make_sizing_plan()
tablets: Add formatter for resize_decision::way_type
tablets: load_balancer: Simplify resize_urgency_cmp()
tablets: load_balancer: Keep config items as instance members
locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
tablets: Set default initial tablet count scale to 10
tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
tablets: load_balancer: Extract get_schema_and_rs()
tablets: load_balancer: Drop test_mode
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.
Fixes#21859.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#22933
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.
Fixes#22417
Most of the code from `recognized_options` is either incorrect or lacks
any implementation, for example:
- comments for Everywhere and Local strategies are contradictory, first
says to allow all options, second says that the strategy doesn't accept
any options, even though both functions have the same implementation,
- for Local & Everywhere strategies the same logic is repeated in
`validate_options` member functions, i.e. this function does nothing,
- for NetworkTopology this function returns DC names and tablet options, but tablet
options are empty; OTOH this strategy also accepts 'replication_factor'
tag, which was ommitted,
- for SimpleStrategy this function returns `replication_factor`, but this is also validated
in `validate_options` function called just before the removed
function.
All of it makes `validate_replication_strategy` work incorrectly.
That being said, 3 tests fail because of this logic's removal, so it did
something after all. The failing tests are commented out, so that the CI
passes, and will be restored in the next commit(s).
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).
Users can see running resize tasks - finished tasks are not
presented with the task manager API.
A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.
Fixes: #21366.
Fixes: #21367.
No backport, new feature
Closesscylladb/scylladb#21891
* github.com:scylladb/scylladb:
test: boost: check resize_task_info in tablet_test.cc
test: add tests to check revoked resize virtual tasks
test: add tests to check the list of resize virtual tasks
test: add tests to check spilt and merge virtual tasks status
test: test_tablet_tasks: generalize functions
replica: service: add split virtual task's children
replica: service: pass parent info down to storage_group::split
tasks: children of virtual tasks aren't internal by default
tasks: initialize shard in task_info ctor
service: extend tablet_virtual_task::abort
service: retrun status_helper struct from tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::wait
tasks: add suspended task state
service: extend tablet_virtual_task::get_status
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: add service::task_manager_module::get_nodes
tasks: add task_manager::get_nodes
tasks: drop noexcept from module::get_nodes
replica: service: add resize_task_info static column to system.tablets
locator: extend tablet_task_info to cover resize tasks
Add resize_task_info static column to system.tablets. Set or delete
resize_task_info value when the resize_decision is changed.
Reflect the column content in tablet_map.
"
This series converts repair, streaming and node_ops (and some parts of
alternator) to work on host ids instead of ips. This allows to remove
a lot of (but not all) functions that work on ips from effective
replication map.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13830/
Refs: scylladb/scylladb#21777
"
* 'gleb/move-to-host-id-more' of github.com:scylladb/scylla-dev:
locator: topology: remove no longer use get_all_ips()
gossiper: change get_unreachable_nodes to host ids
locator: drop no longer used ip based functions from effective replication map and friends
test: move network_topology_strategy_test and token_metadata_test to use host id based APIs
replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges
replica/mutation_dump: use host ids instead of ips
alternator: move ttl to work with host ids instead of ips
storage_service: move node_ops code to use host ids instead of host ips
streaming: move streaming code to use host ids instead of host ips
repair: move repair code to use host ids instead of host ips
gossiper: add get_unreachable_host_ids() function
locator: topology: add more function that return host ids to effective replication map
locator: add more function that return host ids to effective replication map
Add migration_task_info column to system.tablets. Set migration_task_info
value on migration request if the feature is enabled in the cluster.
Reflect the column content in tablet_metadata.
The goal of merge is to reduce the tablet count for a shrinking table. Similar to how split increases the count while the table is growing. The load balancer decision to merge is implemented today (came with infrastructure introduced for split), but it wasn't handled until now.
Initial tablet count is respected while the table is in "growing mode". For example, the table leaves it if there was a need to split above the initial tablet count. After the table leaves the mode, the average size can be trusted to determine that the table is shrinking. Merge decision is emitted if the average tablet size is 50% of the target. Hysteresis is applied to avoid oscillations between split and merges.
Similar to split, the decision to merge is recorded in tablet map's resize_type field with the string "merge". This is important in case of coordinator failover, so new coordinator continues from where the old left off.
Unlike split, the preparation phase during merge is not done by the replica (with split compactions), but rather by the coordinator by co-locating sibling tablets in the same node's shard. We can define sibling tablets as tablets that have contiguous range and will become one after merge. The concept is based on the power-of-two constraint and token contiguity. For example, in a table with 4 tablets, tablets of ids 0 and 1 are siblings, 2 and 3 are also siblings.
The algorithm for co-locating sibling tablets is very simple. The balancer is responsible for it, and it will emit migrations so that "odd" tablet will follow the "even" one. For example, tablet 1 will be migrated to where tablet 0 lives. Co-location is low in priority, it's not the end of the world to delay merge, but it's not ideal to delay e.g. decommission or even regular load balancing as that can translate into temporary unbalancing, impacting the user activities. So co-location migrations will happen when there is no more important work to do.
While regular balancing is higher in priority, it will not undo the co-location work done so far. It does that by treating co-located tablets as if they were already merged. The load inversion convergence check was adjusted so balancer understand when two tablets are being migrated instead of one, to avoid oscillations.
When balancer completes co-location work for a table undergoing merge, it will put the id of the table into the resize_plan, which is about communicating with the topology coordinator that a table is ready for it. With all sibling tablets co-located, the coordinator can resize the tablet map (reduce it by a factor of 2) and record the new map into group0. All the replicas will react to it (on token metadata update) by merging the storage (memtable(s) + sstables) of sibling tablets into one.
Fixes#18181.
system test details:
test: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/tablets_split_merge_test.py
yaml file: https://github.com/pehala/scylla-cluster-tests/blob/tablets_split_merge/test-cases/features/tablets/tablets-split-merge-test.yaml
instance type: i3.8xlarge
nodes: 3
target tablet size: 0.5G (scaled down by 10, to make it easier to trigger splits and merges)
description: multiple cycles of growing and shrinking the data set in order to trigger splits and merges.
data_set_size: ~100G
initial_tablets: 64, so it grew to 128 tablets on split, and back to 64 on merge.
latency of reads and writes that happened in parallel to split and merge:
```
$ for i in scylla-bench*; do cat $i | grep "Mode\|99th:\|99\.9th:"; done
Mode: write
99.9th: 3.145727ms
99th: 1.998847ms
99.9th: 3.145727ms
99th: 2.031615ms
Mode: read
99.9th: 3.145727ms
99th: 2.031615ms
99.9th: 3.145727ms
99th: 2.031615ms
Mode: write
99.9th: 3.047423ms
99th: 1.933311ms
99.9th: 3.047423ms
99th: 1.933311ms
Mode: read
99.9th: 3.145727ms
99th: 1.900543ms
99.9th: 3.145727ms
99th: 1.900543ms
Mode: write
99.9th: 5.079039ms
99th: 3.604479ms
99.9th: 35.389439ms
99th: 25.624575ms
Mode: write
99.9th: 3.047423ms
99th: 1.998847ms
99.9th: 3.047423ms
99th: 1.998847ms
Mode: read
99.9th: 3.080191ms
99th: 2.031615ms
99.9th: 3.112959ms
99th: 2.031615ms
```
Closesscylladb/scylladb#20572
* github.com:scylladb/scylladb:
docs: Document tablet merging
tests/boost: Add test to verify correctness of balancer decisions during merge
tests/topology_experimental_raft: Add tablet merge test
service: Handle exception when retrying split
service: Co-locate sibling tablets for a table undergoing merge
gms: Add cluster feature for tablet merge
service: Make merge of resize plan commutative
replica: Implement merging of compaction groups on merge completion
replica: Handle tablet merge completion
service: Implement tablet map resize for merge
locator: Introduce merge_tablet_info()
service: Rename topology::transition_state::tablet_split_finalization
service: Respect initial_tablet_count if table is in growing mode
service: Wire migration_tablet_set into the load balancer
locator: Add tablet_map::sibling_tablets()
service: Introduce sorted_replicas_for_tablet_load()
locator/tablets: Extend tablet_replica equality comparator to three-way
service: Introduce alias to per-table candidate map type
service: Add replication constraint check variant for migration_tablet_set
service: Add convergence check variant for migration_tablet_set
service: Add migration helpers for migration_tablet_set
service/tablet_allocator: Introduce migration_tablet_set
service: Introduce migration_plan::add(migrations_vector)
locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
locator/tablets: Introduce tablet_map::needs_merge()
locator/tablets: Introduce resize_decision::initial_decision()
locator/tablets: Fix return type of three-way comparison operators
service: Extract update of node load on migrations
service: Extract converge check for intra-node migration
service: Extract erase of tablet replicas from candidate list
scripts/tablet-mon: Allow visualization of tablet id
Adding interface to iterate through sibling tablets for a given table,
one pair at a time.
Initially I thought of having for_each_sibling_tablet do nothing for single
tablet tables. But later I bumped into complications when wiring it into
load balancer for building candidate list, since single-tablet tables
have to be special cased.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Know whether resize (e.g. split) decision was needed above initial tablet
count will be helpful for guiding the merge decision, since we don't
want a merge to happen while table is still growing, but hasn't left
the merge threshold yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.
Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views
Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices
Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
- `std::ranges::to` adaptor requires rvalue references
- Necessitated updates to variable and parameter constness
- Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
from `common` to enable efficient range conversion
2. Range Iteration and Mutation
- Range views may mutate internal state during iteration
- Cannot pass ranges by const reference in some scenarios
- Solution: Pass ranges by rvalue reference to explicitly indicate
state invalidation
Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change
This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21648
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.
The current repair scheduler implementation does the following:
- A tablet is picked to be repaired when the time since last repair is
bigger than a threshold (auto repair mode) or it is requested by user
(manual repair mode)
- The tablet repair can be scheduled along with tablet migration and
rebuild. It runs in the tablet_migration track.
- Repair jobs are scheduled in a smart way so that at any point in time,
there are no more than configured jobs per shard, which is similar to
scylla manager's control.
In this patch, both the manual repair and the auto repair are not
enabled yet.
Added `get_token_range_after_split()`, which returns the token range the
given token will belong to after a tablet split. This is required to
estimate the token ranges of resultant sstables after a split.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implement `get_token_range()` to return the token range of the specified
tablet with the given `log2_tablets` size. This will be used to deduce
which range a token will end up in if the tablet is split.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Implement `get_last_token()`, which returns the largest token owned
by the specified tablet with the given `log2_tablets` size. This will be
used to deduce token ranges for a tablet with any arbitrary
`tablet_count`.
Also, update the existing public `get_last_token()` to utilize the new
variant.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
When there are zero tablets, tablet_metadata::_balancing_enabled
is ignored in the copy.
The property not being preserved can result in balancer not
respecting user's wish to disable balancing when a replica is
created later on.
Fixes#21175.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#21177
There are two bits that control whenter replication strategy for a
keyspace will use tablets or not -- the configuration option and CQL
parameter. This patch tunes its parsing to implement the logic shown
below:
if (strategy.supports_tablets) {
if (cql.with_tablets) {
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
throw "tablets are not enabled";
}
} else if (cql.with_tablets = off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
if (cfg.enable_tablets) {
return create_keyspace_with_tablets();
} else {
return create_keyspace_without_tablets();
}
}
} else { // strategy doesn't support tablets
if (cql.with_tablets == on) {
throw "invalid cql parameter";
} else if (cql.with_tablets == off) {
return create_keyspace_without_tablets();
} else { // cql.with_tablets is not specified
return create_keyspace_without_tablets();
}
}
closes: #20088
In order to enable tablets "by default" for NetworkTopologyStrategy
there's explicit check near ks_prop_defs::get_initial_tablets(), that's
not very nice. It needs more care to fix it, e.g. provide feature
service reference to abstract_replication_strategy constructor. But
since ks_prop_defs code already highjacks options specifically for that
strategy type (see prepare_options() helper), it's OK for now.
There's also #20768 misbehavior that's preserved in this patch, but
should be fixed eventually as well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20779
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
on a single mutation, allowing for incremental hint extraction
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".
Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:
storage_proxy - No mapping for :: in the passed effective replication map
It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.
Users which need replica list stability should use the host_id-based API.
Fixes#18843Closesscylladb/scylladb#18955
* github.com:scylladb/scylladb:
tablets: Filter-out left nodes in get_natural_endpoints()
test: pylib: Extract start_writes() load generator utility
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>