Commit Graph

73 Commits

Author SHA1 Message Date
Tomasz Grabiec
7479167af2 tablets: Filter-out left nodes in get_natural_endpoints()
The API already promises this, the comment on effective_replication_map says:
"Excludes replicas which are in the left state".

Tablet replicas on the replaced node are rebuilt after the node
already left. We may no longer have the IP mapping for the left node
so we should not include that node in the replica set. Otherwise,
storage_proxy may try to use the empty IP and fail:

  storage_proxy - No mapping for :: in the passed effective replication map

It's fine to not include it, because storage proxy uses keyspace RF
and not replica list size to determine quorum. The node is not coming
up, so noone should need to contact it.

Users which need replica list stability should use the host_id-based API.

Fixes #18843

(cherry picked from commit 0d596a425c)
2024-06-11 12:18:17 +02:00
Benny Halevy
bdf3e71f62 locator: tablet_map: add get_primary_replica_within_dc
Will be needed by repair in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 84761acc31)
2024-06-03 19:50:40 +00:00
Benny Halevy
ec30bdc483 locator: tablet_map: get_primary_replica: do not copy tablet info
Currently, the function needlessly copies the tablet_info
(all tablet replicas in particular) to a local variable.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2de79c39dc)
2024-06-03 19:50:40 +00:00
Benny Halevy
21f87c9cfa locator: tablet_map: get_primary_replica: return tablet_replica
This is required by repair when it will start using get_primary_replica
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c52f70f92c)
2024-06-03 19:50:39 +00:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Tomasz Grabiec
6c6ce2d928 tablets: Get rid of tablet_map::get_shard()
Its semantics do not fit well with intra-node migration which allow
two owning shards. Replace uses with the new has_replica() API.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
82b34d34d8 tablets: Introduce tablet transition kind for intra-node migration
We need a separate transition kind for intra node migration so that we
don't have to recover this information from replica set in an
expensive way. This information is needed in the hot path - in
effective_replicaiton_map, to not return the pending tablet replica to
the coordinator. From its perspective, replica set is not
transitional.

The transition will also be used to alter the behavior of the
sharder. When not in intra-node migration, the sharder should
advertise the shard which is either in the previous or next replica
set. During intra-node migration, that's not possible as there may be
two such shards. So it will return the shard according to the current
read selector.
2024-05-16 00:28:46 +02:00
Botond Dénes
32a0867b38 locator/tablets: introduce the primary replica concept
The primary replica is an arbitrary replica of the tablet's, which is
considered to tbe the "main" owner of the tablet, similar to how
replicas own tokens in the vnode world.
To avoid aliasing the primary replicas with a certain DC or rack,
primary replicas are rotated among the tablet's replicas, selecting
tablet_id % replica_count as the primary replica.
2024-05-13 01:35:05 -04:00
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Pavel Emelyanov
725b2863d2 tablet: Make pending replica optional
Just like leaving replica could be optional when adding replica to
tablet, the pending replica can be optional too if we're removing a
replica from tablet

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Pavel Emelyanov
b0cba57e29 tablet: Make leaving replica optional
When getting leaving replica from from tablet info and transition info,
the getter code assumes that this replica always exists. It's not going
to be the case soon, so make the return value be optional.

There are four places that mess with leaving replica:

- stream tablet handler: this place checks that the leaving replica is
  _not_ current host. If leaving replica is missing, the check should
  pass

- cleanup tablet handler: this place checks that the leaving replica
  _is_ current host. If leaving replica is missing, the check should
  fail as well

- topology coordinator: it gets leaving replica to call cleanup on. If
  leaving replica is missing, the cleanup call is short-circuited to
  succeed immediately

- load-stats calculator: it checks if the leaving replica is self. This
  check is not patched as it's automatically satisfied by std::optional
  comparison operator overload for wrapped type

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-04 09:03:36 +03:00
Raphael S. Carvalho
12714a4123 locator: Avoid tablet map lookup on every write for getting replicas
We can cache tablet map in erm, to avoid looking it up on every write for
getting write replicas. We do that in tablet_sharder, but not in tablet
erm. Tablet map is immutable in the context of a given erm, so the
address of the map is stable during erm lifetime.

This caught my attention when looking at perf diff output
(comparing tablet and vnode modes).

It also helps when erm is called again on write completion for
checking locality, used for forwarding info to the driver if needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18158
2024-04-03 10:28:04 +02:00
Pavel Emelyanov
04370dc8a4 tablets: Introduce substract_sets()
There are several places in code that calculate replica sets associated
with specific tablet transision. Having a helper to substract two sets
improves code readability.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18033
2024-03-26 23:33:06 +02:00
Avi Kivity
4ddf82e58b treewide: don't #include "gms/feature_service.hh" from other headers
feature_service.hh is a high-level header that integrates much
of the system functionality, so including it in lower-level headers
causes unnecessary rebuilds. Specifically, when retiring features.

Fix by removing feature_service.hh from headers, and supply forward
declarations and includes in .cc where needed.

Closes scylladb/scylladb#18005
2024-03-26 15:31:18 +02:00
Raphael S. Carvalho
6bdb456fad sstables_loader: Fix loader when write selector is previous during tablet migration
The loader is writing to pending replica even when write selector is set
to previous. If migration is reverted, then the writes won't be rolled
back as it assumes pending replicas weren't written to yet. That can
cause data resurrection if tablet is later migrated back into the same
replica.

NOTE: write selector is handled correctly when set to next, because
get_natural_endpoints() will return the next replica set, and none
of the replicas will be considered leaving. And of course, selector
set to both is also handled correctly.

Fixes #17892.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17902
2024-03-24 01:20:50 +01:00
Tomasz Grabiec
1c71f44e63 tablets, raft topology: Rebuild tablets after replacing node is normal
This fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because new tablet replica allocation
will not be able to find a viable destination, as the replacing node
is not considered a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.

The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is left and replacing node is normal.

The replacing node waits for this draining to be complete on boot
before the node is considered booted.

Fixes #17025
2024-03-15 13:20:08 +01:00
Tomasz Grabiec
888dc41d66 effective_replication_map: Introduce host_id-based get_replicas() 2024-03-15 11:05:29 +01:00
Patryk Wrobel
75aadeb32f locator/effective_replication_map: make 'get_ranges(inet_address ep)' virtual
Before this patch, the mentioned function was a specific
member of vnode_effective_replication_strategy class.
To allow its usage also when tablets are enabled it was
shifted to the base class - effective_replication_strategy
and made pure virtual to force the derived classes to
implement it.

It is used by 'storage_service::get_ranges_for_endpoint()'
that is used in calculation of effective ownership. Such
calculation needs to be performed also when tablets are
enabled.

Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-03-11 09:50:20 +01:00
Patryk Wrobel
3fff6bd407 locator/tablets: add tablet_map::get_sorted_tokens()
This change introudces a new member function that
returns a vector of sorted tokens where each pair of adjacent
elements depicts a range of tokens that belong to tablet.

It will be used to produce the equivalent of sorted_tokens() of
vnodes when trying to use dht::describe_ownership() for tablets.

Refs: scylladb#17342
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
2024-03-11 09:50:20 +01:00
Kefu Chai
64e14d21db locator/tablets: add fmt::formatter for tablet_*
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for

* tablet_id
* tablet_replica
* tablet_metadata
* tablet_map

their operator<<:s are dropped

Refs scylladb/scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17504
2024-03-07 09:00:49 +03:00
Botond Dénes
7bdd0c2cae locator: introduce tablet_range_spliter
Given a list of partition-ranges, yields the intersection of this
range-list, with that of that tablet-ranges, for tablets located on the
given host.
This will be used in multishard_mutation_query.cc, to obtain the ranges
to read from the local node: given the read ranges, obtain the ranges
belonging to tablets who have replicas on the local node.
2024-02-21 02:08:48 -05:00
Pavel Emelyanov
72f3b1d5fe topology.tablets_migration: Add cleanup_target transition stage
The new stage will be used to revert migration that fails at some
stages. The goal is to cleanup the pending replica, which may already
received some writes by doing the cleanup RPC to the pending replica,
then jumping to "revert_migration" stage introduced earlier.

If pending node is dead, the call to cleanup RPC is skipped.

Coordinators use old replicas.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-20 08:59:06 +03:00
Pavel Emelyanov
ced5bf56eb topology.tablets_migration: Add revert_migration transition stage
It's like end_migration, but old replicas intact just removing the
transition (including new replicas).

Coordinators use old replicas.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-20 08:53:36 +03:00
Botond Dénes
3f2d7e8b25 tree: remove unnecessary yields around for_each_tablet()
Commit 904bafd069 consolidated the two
existing for_each_tablet() overloads, to the one which has a future<>
returning callback. It also added yields to the bodies of said
callbacks. This is unnecessary, the loop in for_each_tablet() already
has a yield per tablet, which should be enough to prevent stalls.

This patch is a follow-up to #17118

Closes scylladb/scylladb#17284
2024-02-12 17:10:25 +01:00
Asias He
904bafd069 tablets: Convert to use the new version of for_each_tablet
It is more gently than the old one.
2024-02-05 18:45:40 +08:00
Asias He
fab0d33d08 tablets: Add for_each_tablet_gently
In this version, the callback returns a future<>, so it can yield itself
to avoid stalls in func itself.
2024-02-05 13:42:08 +08:00
Avi Kivity
c8397f0287 Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.

If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.

Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.

A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.

Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).

A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.

When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.

When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).

Fixes #16536.

Closes scylladb/scylladb#16580

* github.com:scylladb/scylladb:
  test/topology_experimental_raft: Add tablet split test
  replica: Bypass reshape on boot with tablets temporarily
  replica: Fix table::compaction_group_for_sstable() for tablet streaming
  test/topology_experimental_raft: Disable load balancer in test fencing
  replica: Remap compaction groups when tablet split is finalized
  service: Split tablet map when split request is finalized
  replica: Update table split status if completed split compaction work
  storage_service: Implement split monitor
  topology_cordinator: Generate updates for resize decisions made by balancer
  load_balancer: Introduce metrics for resize decisions
  db: Make target tablet size a live-updateable config option
  load_balancer: Implement resize decisions
  service: Wire table_resize_plan into migration_plan
  service: Introduce table_resize_plan
  tablet_mutation_builder: Add set_resize_decision()
  topology_coordinator: Wire load stats into load balancer
  storage_service: Allow tablet split and migration to happen concurrently
  topology_coordinator: Periodically retrieve table_load_stats
  locator: Introduce topology::get_datacenter_nodes()
  storage_service: Implement table_load_stats RPC
  replica: Expose table_load_stats in table
  replica: Introduce storage_group::live_disk_space_used()
  locator: Introduce table_load_stats
  tablets: Add resize decision metadata to tablet metadata
  locator: Introduce resize_decision
2024-01-31 13:59:56 +02:00
Tomasz Grabiec
36f218c83b Merge 'main: refuse startup when tablet resharding is required' from Botond Dénes
We do not support tablet resharding yet. All tablet-related code assumes that the (host_id, shard) tablet replica is always valid. Violating this leads to undefined behaviour: errors in the tablet load balancer and potential crashes.
Avoid this by refusing to start if the need to resharding is detected. Be as lenient as possible: check all tablets with a replica on this node, and only refuse startup if at least one tablet has an invalid replica shard.

Startup will fail as:

    ERROR 2024-01-26 07:03:06,931 [shard 0:main] init - Startup failed: std::runtime_error (Detected a tablet with invalid replica shard, reducing shard count with tablet-enabled tables is not yet supported. Replace the node instead.)

Refs: #16739
Fixes: #16843

Closes scylladb/scylladb#17008

* github.com:scylladb/scylladb:
  test/topolgy_experimental_raft: test_tablets.py: add test for resharding
  test/pylib: manager[_client]: add update_cmdline()
  main: refuse startup when tablet resharding is required
  locator: tablets: add check_tablet_replica_shards()
2024-01-29 23:39:41 +01:00
Botond Dénes
95b6aeebae locator: tablets: add check_tablet_replica_shards()
Checks that all tablets with a replica on the this node, have a valid
replica shard (< smp::count).
Will be used to check whether the node can start-up with the current
shard-count.
2024-01-29 07:04:33 -05:00
Pavel Emelyanov
3abdb3c7ee tablets: Remove tablet_aware_replication_strategy::parse_initial_tablets
It's now unused, string with initial tablets its parsed elsewhere

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17010
2024-01-29 10:03:38 +02:00
Raphael S. Carvalho
e0de3dd844 topology_cordinator: Generate updates for resize decisions made by balancer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:58:40 -03:00
Raphael S. Carvalho
6c74fc4b82 locator: Introduce table_load_stats
This is per table stats that will be aggregated from all nodes, by
the coordinator, in order to help load balancer make resize
decisions.

size_in_bytes is the total aggregated table size, so coordinator
becomes responsible for taking into account RF of each DC and
also tablet count, for computing an accurate average size.

split_ready_seq_number is the minimum sequence number among all
replicas. If coordinator sees all replicas store the seq number
of current split, then it knows all replicas are ready for the
next stage in the split process.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
0d5ba1ee4b tablets: Add resize decision metadata to tablet metadata
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:06 -03:00
Raphael S. Carvalho
57582ac9c4 locator: Introduce resize_decision
resize_decision is the metadata the says whether tablets of a table
needs split, merge, or none. That will be recorded in tablet metadata,
and therefore stored in group0.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:31:12 -03:00
Tomasz Grabiec
e5dcf03b88 tablets: Add support for removenode and replace handling
New tablet replicas are allocated synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.

Tablet scheduler is responsible for scheduling tablet transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.

Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.

New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.

In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.

Fixes #16690.
2024-01-23 01:19:42 +01:00
Tomasz Grabiec
649ca0e46c tablets: Introduce get_migration_streaming_info() which works on migration request
Will be used by tablet load balancer to compute impact on load of
planned migrations. Currently, the logic is hard coded in the load
balancer and may get out of sync with the logic we have in
get_migration_streaming_info() for already running tablet transitions.

The logic will become more complex for rebuild transition, so use
shared code to compute it.
2024-01-23 01:12:57 +01:00
Tomasz Grabiec
6dc56fd80b tablets: Move migration_to_transition_info() to tablets.hh 2024-01-23 01:12:57 +01:00
Tomasz Grabiec
1df256221c tablets: Extract get_new_replicas() which works on migraiton request
Now we have a single place which translates tablet migration request to new
replicas.

Will be reused in other places.
2024-01-23 01:12:57 +01:00
Tomasz Grabiec
4a06ffb43c tablets: Store transition kind per tablet
Will be used to distinguish regular migration from rebuild, repair and
RF change.
2024-01-23 01:12:57 +01:00
Pavel Emelyanov
941f6d8fca cql: Move initial_tablets from REPLICATION to TABLETS in DDL
This patch changes the syntax of enabling tablets from

  CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> }

to be

  CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }

and updates all tests accordingly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-01-15 13:04:48 +03:00
Petr Gusev
1928dc73a8 erm: has_pending_ranges: switch to host_id
In the next patches we are going to change erm data structures
(replication_map and ring_mapping) from IP to host_id. Having
locator::host_id instead of IP in has_pending_ranges arguments
makes this transition easier.
2024-01-12 12:23:19 +04:00
Kefu Chai
2c394e3f6f tablets: remove unused #includes
the removed #include headers are not used, so let's drop their
`#include`s.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16619
2024-01-03 15:30:40 +01:00
Pavel Emelyanov
c43501d973 locator,schema: Move initial tablets from r.s. options to params
The option is kepd in DDL, but is _not_ stored in
system_schema.keyspaces. Instead, it's removed from the provided options
and kept in scylla_keyspaces table in its own column. All the places
that had optional initial_tablets disengaged now set this value up the
way the find appropriate.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-12-25 16:07:10 +03:00
Pavel Emelyanov
45f4276de6 locator: Pass abstract_replication_strategy& into validate_tablet_options()
It will need to check if the r.s. in question had been marked as
per-table one in next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-12-25 15:56:49 +03:00
Pavel Emelyanov
bf824d79d9 locator: Carry r.s. params into process_tablet_options()
The latter method is the one that will need extended params in next
patches. It's called from network_topology_strategy() constructor which
already has params at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-12-25 15:56:02 +03:00
Raphael S. Carvalho
bcbba9a5e3 locator: Introduce tablet_map::get_tablet_id_and_range_side(token)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-12-17 11:26:32 -03:00
Petr Gusev
7b55ccbd8e token_metadata: drop the template
Replace token_metadata2 ->token_metadata,
make token_metadata back non-template.

No behavior changes, just compilation fixes.
2023-12-12 23:19:54 +04:00
Petr Gusev
11cc21d0a9 erm: switch to the new token_metadata
In this commit we replace token_metadata with token_metadata2
in the erm interface and field types. To accommodate the change
some of strategy-related methods are also updated.

All the boost and topology tests pass with this change.
2023-12-12 23:19:53 +04:00
Petr Gusev
d9283bd025 tablets: switch to token_metadata2
locator_topology_test, network_topology_strategy_test and
tablets_test are fully switched to the host_id-based token_metadata,
meaning they no longer populate the old token_metadata.

All the boost and topology tests pass with this change.
2023-12-12 23:19:53 +04:00
Petr Gusev
5a1418fdba token_metadata: get_endpoint_for_host_id -> get_endpoint_for_host_id_if_known
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.

We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
2023-12-11 12:51:34 +04:00