Commit Graph

101 Commits

Author SHA1 Message Date
Michael Litvak
c11a2e2aaf tablets: deallocate storage state on end_migration
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:

* When the stage is updated from `cleanup` to `end_migration`, the
  storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
  then we don't allocate a storage group for it. This happens for
  example if the leaving replica is restarted during tablet migration.
  If it's initialized in `cleanup` stage then we allocate a storage
  group, and it will be deallocated when transitioning to
  `end_migration`.

This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.

It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.

Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.

This fixes the following issue:

1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
   migration is completed - it's not cleaned up properly.

Fixes scylladb/scylladb#23481

(cherry picked from commit 34f15ca871)
2025-06-17 13:59:10 +00:00
Tomasz Grabiec
2597a7e980 load_sketch: Tolerate missing tablet_map when selecting for a given table
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.
2025-04-17 16:01:16 +02:00
Aleksandra Martyniuk
acd32b24d3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
4a847df55c locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
5d6041617b locator: add maybe_get_primary_replica
Add maybe_get_primary_replica to choose a primary replica out of
custom replica set.
2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
ed7b8bb787 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
2025-04-08 10:42:01 +02:00
Dawid Mędrek
32879ec0d5 db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356
2025-03-19 14:46:35 +01:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Aleksandra Martyniuk
2b538d228c locator: add round-robin selection of filtered replicas 2025-02-28 12:32:55 +01:00
Aleksandra Martyniuk
fe4e99d7b3 locator: add tablet_task_info::selected_by_filters
Extract dcs and hosts filters check to a method.
2025-02-28 12:02:21 +01:00
Avi Kivity
d99df7af6c Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec
This series achieves two things:

1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards

    This will result in new tables having at least 10 tablet replicas per
    shard by default.

    We want this to reduce tablet load imbalance due to differences in
    tablet count per shard, where some shards have 1 tablet and some
    shards have 2 tablets. With higher tablet count per shard, this
    difference-by-one is less relevant.

    Fixes https://github.com/scylladb/scylladb/issues/21967

2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count

    The per-shard goal is enforced by controlling average per-shard tablet replica
    count in a given DC, which is controlled by per-table tablet
    count. This is effective in respecting the limit on individual shards
    as long as tablet replicas are distributed evenly between shards.
    There is no attempt to move tablets around in order to enforce limits
    on individual shards in case of imbalance between shards.

    If the average per-shard tablet count exceeds the limit, all tables
    which contribute to it (have replicas in the DC) are scaled down
    by the same factor. Due to rounding up to the nearest power of 2,
    we may overshoot the per-shard goal by at most a factor of 2.

    The scaling is applied after computing desired tablet count due to
    all other factors: per-table tablet count hints, defaults, average tablet size.

    If different DCs want different scale factors of a given table, the
    lowest scale factor is chosen for a given table.

    When creating a new table, its tablet count is determined by tablet
    scheduler using the scheduler logic, as if the table was already created.
    So any scaling due to per-shard tablet count goal is reflected immediately
    when creating a table. It may however still take some time for the system
    to shrink existing tables. We don't reject requests to create new tables.

    Fixes #21458

Closes scylladb/scylladb#22522

* github.com:scylladb/scylladb:
  config, tablets: Allow tablets_initial_scale_factor to be a fraction
  test: tablets_test: Test scaling when creating lots of tables
  test: tablets_test: Test tablet count changes on per-table option and config changes
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
  config: Make tablets_initial_scale_factor live-updateable
  tablets: load_balancer: Pick initial_scale_factor from config
  tablets, load_balancer: Fix and improve logging of resize decisions
  tablets, load_balancer: Log reason for target tablet count
  tablets: load_balancer: Move hints processing to tablet scheduler
  tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
  tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
  tablets: load_balancer: Determine desired count from size separately from count from options
  tablets: load_balancer: Determine resize decision from target tablet count
  tablets: load_balancer: Allow splits even if table stats not available
  tablets: load_balancer: Extract make_sizing_plan()
  tablets: Add formatter for resize_decision::way_type
  tablets: load_balancer: Simplify resize_urgency_cmp()
  tablets: load_balancer: Keep config items as instance members
  locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
  tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
  tablets: Set default initial tablet count scale to 10
  tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
  tablets: load_balancer: Extract get_schema_and_rs()
  tablets: load_balancer: Drop test_mode
2025-02-24 17:59:26 +02:00
Raphael S. Carvalho
4d8a333a7f storage_service: Don't retry split when table is dropped
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.

Fixes #21859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22933
2025-02-20 10:13:55 +01:00
Tomasz Grabiec
d3ffea77e6 tablets: load_balancer: Extract make_sizing_plan()
Resize plan making will now happen in two stages:

  1) Determine desired tablet counts per table (sizing plan)
  2) Schedule resize decisions

We need intermediate step in the resize plan making, which gives us
the planned tablet counts, so that we can plug this part of the
algorithm into initial tablet allocation on table construction.

We want decisisons made by the scheduler to be consistent with
decisions made on table creation. We want to avoid over-allocation of
tablets when table is created, which would then be reduced by the
scheduler. Not just to avoid wasteful migrations post table creation,
but to respect the per-shard goal. To respect the per-shard goal, the
algorithm will no longer be as simple as looking at hints, and we want
to share the algorithm between the scheduler and initial tablet
allocator.

Also, this sizing plan will be later plugged into a virtual table for
observability.
2025-02-19 14:40:06 +01:00
Tomasz Grabiec
33db0d4fea tablets: Add formatter for resize_decision::way_type 2025-02-19 14:39:40 +01:00
Asias He
5545289bfa repair: Introduce Host and DC filter support
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.

Fixes #22417
2025-02-14 09:13:11 +01:00
Benny Halevy
021fc3c756 tablets: resize_decision: get rid of initial_decision
Now, with tablet_hints calculation of min_tablet_count
it is not used anymore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 18:43:47 +02:00
Benny Halevy
20c6ca2813 tablet_allocator: consider tablet options for resize decision
Do not merge tablets if that would drop the tablet_count
below the minimum provided by hints.
Split tablets if the current tablet_count is less than
the minimum tablet count calculated using the table's tablet options.

TODO: override min_tablet_count if the tablet count per shard
is greater than the maximum allowed.  In this case
the tables tablet counts should be scaled down proportionally.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 18:43:35 +02:00
Botond Dénes
47989b1503 Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).

Users can see running resize tasks - finished tasks are not
presented with the task manager API.

A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.

Fixes: #21366.
Fixes: #21367.

No backport, new feature

Closes scylladb/scylladb#21891

* github.com:scylladb/scylladb:
  test: boost: check resize_task_info in tablet_test.cc
  test: add tests to check revoked resize virtual tasks
  test: add tests to check the list of resize virtual tasks
  test: add tests to check spilt and merge virtual tasks status
  test: test_tablet_tasks: generalize functions
  replica: service: add split virtual task's children
  replica: service: pass parent info down to storage_group::split
  tasks: children of virtual tasks aren't internal by default
  tasks: initialize shard in task_info ctor
  service: extend tablet_virtual_task::abort
  service: retrun status_helper struct from tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::wait
  tasks: add suspended task state
  service: extend tablet_virtual_task::get_status
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: add service::task_manager_module::get_nodes
  tasks: add task_manager::get_nodes
  tasks: drop noexcept from module::get_nodes
  replica: service: add resize_task_info static column to system.tablets
  locator: extend tablet_task_info to cover resize tasks
2025-01-17 14:24:07 +02:00
Asias He
cd96fb5a78 repair: Add repair_hosts_filter and repair_dcs_filter
They will be useful for hosts and DCs selection for the repair
scheduler. It is not implemented yet. Adding it earlier, so we do not
need to change the system tabler later.

Closes scylladb/scylladb#21985
2025-01-14 08:46:26 +02:00
Aleksandra Martyniuk
18b829add8 replica: service: add resize_task_info static column to system.tablets
Add resize_task_info static column to system.tablets. Set or delete
resize_task_info value when the resize_decision is changed.
Reflect the column content in tablet_map.
2025-01-10 10:03:07 +01:00
Aleksandra Martyniuk
b6b4b767de locator: extend tablet_task_info to cover resize tasks 2025-01-10 10:03:07 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Aleksandra Martyniuk
9fad3a621a replica: service: add migration_task_info column to system.tablets
Add migration_task_info column to system.tablets. Set migration_task_info
value on migration request if the feature is enabled in the cluster.
Reflect the column content in tablet_metadata.
2024-12-11 12:07:36 +01:00
Aleksandra Martyniuk
332347490c locator: extend tablet_task_info to cover migration tasks 2024-12-11 12:07:36 +01:00
Aleksandra Martyniuk
dee6404aa4 locator: rename tablet_task_info methods 2024-12-11 12:07:36 +01:00
Raphael S. Carvalho
014e1c9a0f locator: Introduce merge_tablet_info()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:00 -03:00
Raphael S. Carvalho
a5cc6fb297 locator: Add tablet_map::sibling_tablets()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
fd6bf7b357 locator/tablets: Extend tablet_replica equality comparator to three-way
Will be needed later for sorting tablet replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
3082ff992c locator/tablets: Introduce tablet_map::for_each_sibling_tablets()
Adding interface to iterate through sibling tablets for a given table,
one pair at a time.

Initially I thought of having for_each_sibling_tablet do nothing for single
tablet tables. But later I bumped into complications when wiring it into
load balancer for building candidate list, since single-tablet tables
have to be special cased.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
47c8237de0 locator/tablets: Introduce tablet_map::needs_merge()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
93990eb162 locator/tablets: Introduce resize_decision::initial_decision()
Know whether resize (e.g. split) decision was needed above initial tablet
count will be helpful for guiding the merge decision, since we don't
want a merge to happen while table is still growing, but hasn't left
the merge threshold yet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Raphael S. Carvalho
61f694acf5 locator/tablets: Fix return type of three-way comparison operators
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 20:45:20 -03:00
Kefu Chai
4bc7e068ff locator: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21754
2024-12-03 11:05:35 +02:00
Asias He
b71a563030 repair: Add core tablet repair scheduler support
This adds a new tablet migration kind: repair. It allows tablet repair
scheduler to use this migration kind to schedule repair jobs.

The current repair scheduler implementation does the following:

- A tablet is picked to be repaired when the time since last repair is
  bigger than a threshold (auto repair mode) or it is requested by user
  (manual repair mode)

- The tablet repair can be scheduled along with tablet migration and
  rebuild. It runs in the tablet_migration track.

- Repair jobs are scheduled in a smart way so that at any point in time,
  there are no more than configured jobs per shard, which is similar to
  scylla manager's control.

In this patch, both the manual repair and the auto repair are not
enabled yet.
2024-11-20 09:42:41 +08:00
Asias He
82a10eca55 tablet_allocator: Introduce stream_weight for tablet_migration_streaming_info
The stream_weight for repair migration is set to 2, because it requires
more work than just moving the tablet around. The stream_weight for all
other migrations are set to 1.
2024-11-19 10:04:41 +08:00
Lakshmi Narayanan Sreethar
4130187e78 tablet_map: introduce get_token_range_after_split()
Added `get_token_range_after_split()`, which returns the token range the
given token will belong to after a tablet split. This is required to
estimate the token ranges of resultant sstables after a split.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:23:47 +05:30
Lakshmi Narayanan Sreethar
f536c7d15b tablet_map: introduce get_token_range() variant
Implement `get_token_range()` to return the token range of the specified
tablet with the given `log2_tablets` size. This will be used to deduce
which range a token will end up in if the tablet is split.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:21:05 +05:30
Lakshmi Narayanan Sreethar
f655136091 tablet_map: introduce get_last_token() variant
Implement `get_last_token()`, which returns the largest token owned
by the specified tablet with the given `log2_tablets` size. This will be
used to deduce token ranges for a tablet with any arbitrary
`tablet_count`.

Also, update the existing public `get_last_token()` to utilize the new
variant.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:18:04 +05:30
Kefu Chai
f9091066b7 treewide: replace boost::irange with std::views::iota where possible
when building scylla with the standard library from GCC-14.2, shipped by
fedora 41, we have following build failure:

```
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/init.cc.o -MF CMakeFiles/scylla-main.dir/Debug/init.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/init.cc.o -c /home/kefu/dev/scylladb/init.cc
In file included from /home/kefu/dev/scylladb/init.cc:12:
In file included from /home/kefu/dev/scylladb/db/config.hh:20:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                              ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                                      ^
3 errors generated.
[16/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/keys.cc.o
[17/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/counters.cc.o
[18/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/partition_slice_builder.cc.o
[19/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
FAILED: CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o
/home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -MF CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -c /home/kefu/dev/scylladb/mutation_query.cc
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:11:
In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26:
/home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                              ^
/home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost'
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                ~~~~~~~^
/home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value]
  410 |         return boost::irange<size_t>(0, tablet_count()) | boost::adaptors::transformed([] (size_t i) {
      |                                      ^
In file included from /home/kefu/dev/scylladb/mutation_query.cc:12:
In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17:
In file included from /home/kefu/dev/scylladb/replica/database.hh:37:
In file included from /home/kefu/dev/scylladb/db/snapshot-ctl.hh:20:
/home/kefu/dev/scylladb/tasks/task_manager.hh:403:54: error: no member named 'irange' in namespace 'boost'
  403 |         co_await coroutine::parallel_for_each(boost::irange(0u, smp::count), [&tm, id, &res, &func] (unsigned shard) -> future<> {
      |                                               ~~~~~~~^
4 errors generated.
```

so let's take the opportunity to switch from `boost::irange` to
`std::views::iota`.

in this change, we:

- switch from boost::irange to std::views::iota for better standard library compatibility
- retain boost::irange where step parameter is used, as std::views::iota doesn't support it
- this change partially modernizes our range usage while maintaining
- existing functionality

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20924
2024-10-03 10:33:33 +03:00
Botond Dénes
f5976aa87b replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
Extract a hint of what a tablet mutation changed. The hint can be later
used to selectively reload only the changed parts from disk.
Two variants are added:
* get_tablet_metadata_change_hint() - extracts a hint from a list of
  tablet mutations
* update_tablet_metadata_change_hint() - updates an existing hint based
  on a single mutation, allowing for incremental hint extraction
2024-08-11 09:52:37 -04:00
Botond Dénes
54ea71f8a6 locator/tablets: add tablet_map::clear_tablet_transition_info() 2024-08-11 09:52:37 -04:00
Botond Dénes
0254cfc7d3 locator/tablets: make tablet_metadata cheap to copy
Keep lw_shared_ptr<tablet_map> in the tablet map and use COW semantics.
To prevent accidental changes to shared tablet_map instances, all
modifications to a tablet_map have to go through a new
`mutate_tablet_map()` method, which implements the copy-modify-swap
idiom.
2024-08-11 09:52:37 -04:00
Tomasz Grabiec
98323be296 repair: Exclude tablet migrations with tablet repair
We want to exclude repair with tablet migrations to avoid races
between repair reads and writes with replica movement. Repair is not
prepared to handle topology transitions in the middle.

One reason why it's not safe is that repair may successfully write to
a leaving replica post streaming phase and consider all replicas to be
repaired, but in fact they are not, the new replica would not be
repaired.

Other kinds of races could result in repair failures. If repair writes
to a leaving replica which was already cleaned up, such writes will
fail, causing repair to fail.

Excluding works by keeping effective_replication_map_ptr in a version
which doesn't have table's tablets in transitions. That prevents later
transitions from starting because topology coordinator's barrier will
wait for that erm before moving to a stage later than
allow_write_both_read_old, so before any requets start using the new
topology. Also, if transitions are already running, repair waits for
them to finish.

Fixes #17658.
Fixes #18561.
2024-06-05 16:11:22 +02:00
Benny Halevy
84761acc31 locator: tablet_map: add get_primary_replica_within_dc
Will be needed by repair in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-06-02 20:26:09 +03:00
Benny Halevy
c52f70f92c locator: tablet_map: get_primary_replica: return tablet_replica
This is required by repair when it will start using get_primary_replica
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-06-02 20:26:09 +03:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Tomasz Grabiec
aafeacc8d9 dht, auto_refreshing_sharder: Allow overriding write selector
During streaming for intra-node migration we want to write only to the
new shard. To achieve that, allow altering write selector in
sharder::shard_for_writes() and per-instance of
auto_refreshing_sharder.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
6c6ce2d928 tablets: Get rid of tablet_map::get_shard()
Its semantics do not fit well with intra-node migration which allow
two owning shards. Replace uses with the new has_replica() API.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
82b34d34d8 tablets: Introduce tablet transition kind for intra-node migration
We need a separate transition kind for intra node migration so that we
don't have to recover this information from replica set in an
expensive way. This information is needed in the hot path - in
effective_replicaiton_map, to not return the pending tablet replica to
the coordinator. From its perspective, replica set is not
transitional.

The transition will also be used to alter the behavior of the
sharder. When not in intra-node migration, the sharder should
advertise the shard which is either in the previous or next replica
set. During intra-node migration, that's not possible as there may be
two such shards. So it will return the shard according to the current
read selector.
2024-05-16 00:28:46 +02:00
Botond Dénes
32a0867b38 locator/tablets: introduce the primary replica concept
The primary replica is an arbitrary replica of the tablet's, which is
considered to tbe the "main" owner of the tablet, similar to how
replicas own tokens in the vnode world.
To avoid aliasing the primary replicas with a certain DC or rack,
primary replicas are rotated among the tablet's replicas, selecting
tablet_id % replica_count as the primary replica.
2024-05-13 01:35:05 -04:00