Commit Graph

42652 Commits

Author SHA1 Message Date
Tomasz Grabiec
6d809c75fb test: py: tablets: Test writes concurrent with migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
ad02d85c16 test: py: tablets: Test crash during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7956a2991e api, storage_service: Introduce API to wait for topology to quiesce 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
679baff25a dht, replica: Remove deprecated sharder APIs 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
32a191384a test: Avoid using deprecated sharded API
There is not tablet migration in unit tests, so shard_of() can be
safely replaced with shard_for_reads(). Even if it's used for writes.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
539460dd71 db: do_apply_many() avoid deprecated sharded API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0f50504c39 replica: mutation_dump: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7bf5733fa5 repair: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c03646f99 table: Remove optimization which returns empty reader when key is not owned by the shard
This check would lead to correctness issues with intra-node migration
because the shard may switch during read, from "read old" to "read
new". If the coordinator used "read old" for shard routing, but table
on the old shard is already using "read new" erm, such a read would
observe empty result, which is wrong.

Drop the optimization. In the scenario above, read will observe all
past writes because:

  1) writes are still using "write both"

  2) writes are switched to "write new" only after all requests which
  might be using "read old" are done

Replica-side coordinators should already route single-key requests to
the correct shard, so it's not important as an optimization.

This issue shows how assumptions about static sharding are embedded in
the current code base and how intra-node migration, by violating those
assumptions, can lead to correctness issues.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
26f2e6aa8e dht: is_single_shard: Avoid deprecated sharder API
All current uses are used in the read path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9e6b4dca7 dht: split_range_to_single_shard: Work with static_sharder only
In preparation for intra-node tablet migration, to avoid
using deprecated sharder APIs.

This function is used for generating sstable sharding metadata.
For tablets, it is not invoked, so we can safely work with the
static sharder. The call site already passes static_sharder only.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c380aecf64 dht: ring_position_range_sharder: Avoid deprecated sharder APIs
In preparation for tablet intra-node migration.

Existing uses are for reads, so it's safe to use shard_for_reads():
  - in multishard reader
  - in forward_service

The ring_position_range_vector_sharder is used when computing sstable
shards, which for intra-node migration should use the view for
reads. If we haven't completed streaming, sstables should be attached
to the old shard (used by reads). When in write-both-read-new stage,
streaming is complete, reads are using the new shard, and we should
attach sstables to the new shard.

When not in intra-node migration, the view for reads on the pending
node will return the pending shard even if read selector is "read old".
So if pending node restarts during streaming, we will attach to sstables
to the shard which is used by writes even though we're using the selector
for reads.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a1aac409bf dht: token: Avoid use of deprecated sharder API by switching to static_sharder
The touched APIs are used only with static_sharder.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dd4a086b87 selective_token_sharder: Avoid use of deprecated sharder API
I analyzed all the uses and all except the alternator/ttl.cc seem to
be interested in the result for the purpose of reading.

Alternator is not supported with tablets yet, so the use was annotated
with a relevant issue.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
eb3a22d5a8 docs: Document tablet sharding vs tablet replica placement 2024-05-16 00:28:47 +02:00
Botond Dénes
635aba435b readers/multishard.cc: use shard_for_reads() instead of shard_of()
The latter is deprecated.
2024-05-16 00:28:47 +02:00
Botond Dénes
bc779ed00c multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
The latter is deprecated.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
3b7d7088d1 storage_proxy: Extract common code to apply mutations on many shards according to sharder 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
660b3d1765 storage_proxy: Prepare per-partition rate-limiting for intra-node migration
Note: there is a potential problem with rate-limit count going out of sync
during intra-node migration between old and the new shard.

Before this patch, when coordinator accounted and admitted the
request, so the rate_limit_info passed to apply_locally() is
account_only, it was converted to std::monostate for requests to the
local replia. This makes sense because the request was already
accounted by the coordinator.

However, during intra-node migration when we do double writes to two
shards locally, that means that the new shard will not account the
write, it will have lower count than the limiter on the old
shard. This means that the new shard may accept writes which will end
up being rejected. This is not desirable, but not the end of the world
since it's temporary, and the new shard will still protect itself from
overload based on its own rate limiter.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c3291b5ea storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
Cunters are not supported with tablets, so we should not reach this path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
db2809317d storage_proxy: Prepare mutate_hint() for intra-node tablet migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
feafe0f6a7 commitlog_replayer: Avoid deprecated sharder::shard_of()
shard_for_writes() is appropriate, because we're writing.  It can
happen that the tablet was migrated away and no shard is the owner. In
that case the mutation is dropped, as it should be, because "shards"
is empty.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9294b1642 lwt: Avoid deprecated sharder::shard_of()
Instead, use shard_for_reads(). The justification is that:

 1) In cas_shard(), we need to pick a single request coordinator.
    shard_for_reads() gives that, which is equivalent to shard_of()
    if there is no intra-node migration.

 2) In paxos handler for prepare(), the shard we execute it on is
    the shard from which we read, so shard_for_reads() is the one.

 3) Updates of paxos state are separate CQL requests, and use their
    own sharding.

 4) Handler for learn is executing updates using calls to
    storage_proxy::mutate_locally() which will use the right sharder for writes

However, the code is still not prepared for intra-node migration, and
possibly regular migration too in case of abandoned requests, because
the locking of paxos state assumes that the shard is static. That
would have to be fixed separately, e.g. by locking both shards
(shard_for_writes()) during migration, so that the set of locked
shards always intersects during migration and local serialization of
paxos state updates is achieved. I left FIXMEs for that.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
1631bab658 compaction: Avoid deprecated sharder::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
9da3bd84c7 dht: Extract dht::static_sharder
Before the patch, dht::sharder could be instantiated and it would
behave like a static sharder. This is not safe with regards to
extensions of the API because if a derived implementation forgets to
override some method, it would incorrectly default to the
implementation from static sharder. Better to fail the compilation in
this case, so extract static sharder logic to dht::static_sharder
class and make all methods in dht::sharder pure virtual.

This also allows us to have algorithms indicate that they only work
with static sharder by accepting the type, and have compile-time
safety for this requirement.

schema::get_sharder() is changed to return the static_sharder&.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dbca598e99 replica: Deprecate table::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a1bee16ee9 locator: Deprecate effective_replication_map::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
10a4903d0c dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
Require users to specify whether we want shard for reads or for writes
by switching to appropriate non-deprecated variant.

For example, shard_of() can be replaced with shard_for_reads() or
shard_for_writes().

The next_shard/token_for_next_shard APIs have only for-reads variant,
and the act of switching will be a testimony to the fact that the code
is valid for intra-node migration.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
b3cdf9a379 tests: tablets: py: Add intra-node migration test 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
d26cd97633 tests: tablets: Test that drained nodes are not balanced internally
It would be a waste of effort to do so, since we migrate tablets away
anyway.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
04f0088679 tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c76ba52c70 tests: tablets: Verify that disabling balancing results in no intra-node migrations 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0addca88b9 tests: tablets: Check that nodes are internally balanced
Existing tests are augmented with a check which verifies that
all nodes are internally balanced.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0e2617336a tests: tablets: Improve debuggability by showing which rows are missing 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
329342bfb2 tablets, storage_service: Support intra-node migration in move_tablet() API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
db9d3f0128 tablet_allocator: Generate intra-node migration plan
Intra-node migrations are scheduled for each node independently with
the aim to equalize per-shard tablet count on each node.

This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if
RF=N then there is not even a possibility of moving tablets around to
fix the imbalance.  The only way to bring the system to balance is to
move tablets within the nodes.

After scheduling inter-node migrations, the algorithm schedules
intra-node migrations. This means that across-node migrations can
proceed in parallel with intra-node migrations if there is free
capacity to carry them out, but across-node migrations have higher
priority.

Fixes #16594
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
793af3d6e1 tablet_allocator: Extract make_internode_plan()
Currently the load balancer is only generting an inter-node plan, and
the algorithm is embedded in make_plan(). The method will become even
harder to follow once we add more kinds of plan generating steps,
e.g. inter-node plan. Extract the inter-node plan to make it easier to
add other plans and see the grand flow.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
f95a0f0182 tablet_allocator: Maintain candidate list and shard tablet count for
target nodes

The node_load datastructure was not updated to reflect migration
decisions on the target node. This is not needed for inter-node
migration because target nodes are not considered as sources. But we
want it to reflect migration decisions so that later inter-node
migration sees an accurate picture with earlier migrations reflected
in node_load.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
c86f659421 tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
Will be needed by member methods which generate migration plans.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
fdcaaea91a tablets, streaming: Implement tablet streaming for intra-node migration 2024-05-16 00:28:46 +02:00
Tomasz Grabiec
aafeacc8d9 dht, auto_refreshing_sharder: Allow overriding write selector
During streaming for intra-node migration we want to write only to the
new shard. To achieve that, allow altering write selector in
sharder::shard_for_writes() and per-instance of
auto_refreshing_sharder.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
dfed4efcc5 multishard_writer: Handle intra-node migration
This writer is used by streaming, on tablet migration and
load-and-stream.

The caller of distribute_reader_and_consume_on_shards(), which provides
a sharder, is supposed to ensure that effective_replication_map is kept
alive around it, in order for topology coordinator to wait for any writes
which may be in flight to reach their shards before tablet replica starts
another migration. This is already the case:

  1) repair and load-and-stream keep the erm around writing.

  2) tablet migration uses autorefreshing_sharder, so it does not, but
     it keeps the topology_guard around the operation in the consumer,
     which serves the same purpose.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
4df818db98 storage_proxy: Handle intra-node tablet migration for writes
When sharder says that the write should go to multiple shards,
we need to consider the write as applied only if it was applied
to all those shards.

This can happen during intra-node tablet migration. During such migration,
the request coordinator on storage_proxy side is coordinating to hosts
as if no migration was in progress. The replica-side coordinator coordinates
to shards based on sharder response.

One way to think about it is that
effective_replication_map::get_natural_endpoints()/get_pending_endpoints()
tells how to coordinate between nodes, and sharder tells how to
coordinate between shards. Both work with some snapshot of tablet
metadata, which should be kept alive around the operation. Sharder is
associated with its own effective_replication_map, which marks the
topology version as used and allows barriers to synchronize with
replica-side operations.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
6c6ce2d928 tablets: Get rid of tablet_map::get_shard()
Its semantics do not fit well with intra-node migration which allow
two owning shards. Replace uses with the new has_replica() API.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
d000ad0325 tablets: Avoid tablet_map::get_shard in cleanup
In preparation for intra-node migration for which get_shard() is not
prepared.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
daaceda963 tablets: test: Use sharder instead of tablet_map::get_shard()
tablet_map::get_shard() will go away as it is not prepared for
intra-node migration.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
d47dfceb34 tablets: tablet_sharder: Allow working with non-local host
Will be used in tests.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
6946ad2a45 sharding: Prepare for intra-node-migration
Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
b5bb46357b docs: Document sharder use for tablets 2024-05-16 00:28:46 +02:00
Tomasz Grabiec
82b34d34d8 tablets: Introduce tablet transition kind for intra-node migration
We need a separate transition kind for intra node migration so that we
don't have to recover this information from replica set in an
expensive way. This information is needed in the hot path - in
effective_replicaiton_map, to not return the pending tablet replica to
the coordinator. From its perspective, replica set is not
transitional.

The transition will also be used to alter the behavior of the
sharder. When not in intra-node migration, the sharder should
advertise the shard which is either in the previous or next replica
set. During intra-node migration, that's not possible as there may be
two such shards. So it will return the shard according to the current
read selector.
2024-05-16 00:28:46 +02:00