Commit Graph

765 Commits

Author SHA1 Message Date
Botond Dénes
42a76ca568 Merge 'Improve printing of nodes and backtraces in topology' from Pavel Emelyanov
There's a bunch of debug- and trace-level logging of locator::node-s that also include current_backtrace(). Printing node is done via debug_format() helper that generates and returns an sstring to print. Backtrace printing is not very lightweight on its own because of backtrace collecting. Not to slow things down in info log level, which is default, all such prints are wrapped with explicit if-s about log-level being enabled or not.

This PR removes those level checks by introducing lazy_backtrace() helper and by providing a formatter for nodes that also results in lazy node format string calculation.

Closes scylladb/scylladb#17235

* github.com:scylladb/scylladb:
  topology: Restore indentation after previous patch
  topology: Drop if_enabled checks for logging
  topology: Add lazy_backtrace() helper
  topology: Add printer wrapper for node* and formatter for it
  topology: Expand formatter<locator::node>
2024-02-19 09:32:53 +02:00
Patryk Wrobel
a3fb44cbca Rename keyspace::get_effective_replication_map()
This commit renames keyspace::get_effective_replication_map()
to keyspace::get_vnode_effective_replication_map(). This change
is required to ease the analysis of the usage of this function.

When tablets are enabled, then this function shall not be used.
Instead of per-keyspace, per-table replication map should be used.
The rename was performed to distinguish between those two calls.
The next step will be an audit of usages of
keyspace::get_vnode_effective_replication_map().

Refs: scylladb#16626
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#17314
2024-02-13 20:22:02 +02:00
Botond Dénes
3f2d7e8b25 tree: remove unnecessary yields around for_each_tablet()
Commit 904bafd069 consolidated the two
existing for_each_tablet() overloads, to the one which has a future<>
returning callback. It also added yields to the bodies of said
callbacks. This is unnecessary, the loop in for_each_tablet() already
has a yield per tablet, which should be enough to prevent stalls.

This patch is a follow-up to #17118

Closes scylladb/scylladb#17284
2024-02-12 17:10:25 +01:00
Pavel Emelyanov
309d34a147 topology: Restore indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 13:49:15 +03:00
Pavel Emelyanov
f7a13b9bb0 topology: Drop if_enabled checks for logging
Now all the logged arguments are lazily evaluated (node* format string
and backtrace) so the preliminary log-level checks are not needed.

indentation is deliberately left broken

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 13:49:15 +03:00
Pavel Emelyanov
c1ea6c8acf topology: Add lazy_backtrace() helper
This helper returns lazy_eval-ed current_backtrace(), so it will be
generated and printed only if logger is really going to do it with its
current log-level.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 13:49:15 +03:00
Pavel Emelyanov
da53854b66 topology: Add printer wrapper for node* and formatter for it
Currently to print node information there's a debug_format(node*) helper
function that returns back an sstring object. Here's the formatter
that's more flexible and convenient, and a node_printer wrapper, since
formatters cannot format non-void pointers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 13:49:15 +03:00
Pavel Emelyanov
aa0293f411 topology: Expand formatter<locator::node>
Equip it with :v specifier that turns verbose mode on and prints much
more data about the node. Main user will appear in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-09 13:49:15 +03:00
Botond Dénes
35da9551fb Merge 'storage_service: Add describe_ring support for tablet table' from Asias He
The table query param is added to get the describe_ring result for a
given table.

Both vnode table and tablet table can use this table param, so it is
easier for users to user.

If the table param is not provided by user and the keyspace contains
tablet table, the request will be rejected.

E.g.,
curl "http://127.0.0.1:10000/storage_service/describe_ring/system_auth?table=roles"
curl "http://127.0.0.1:10000/storage_service/describe_ring/ks1?table=standard1"

Refs #16509

Closes scylladb/scylladb#17118

* github.com:scylladb/scylladb:
  tablets: Convert to use the new version of for_each_tablet
  storage_service: Add describe_ring support for tablet table
  storage_service: Mark host2ip as const
  tablets: Add for_each_tablet_gently
2024-02-07 10:41:36 +02:00
Tomasz Grabiec
032c1a3d04 Merge 'tablets: Make sure topology has enough endpoints for RF' from Pavel Emelyanov
When creating a keyspace, scylla allows setting RF value smaller than there are nodes in the DC. With vnodes, when new nodes are bootstrapped, new tokens are inserted thus catching up with RF. With tablets, it's not the case as replica set remains unchanged.

With tablets it's good chance not to mimic the vnodes behavior and require as many nodes to be up and running as the requested RF is. This patch implementes this in a lazy manned -- when creating a keyspace RF can be any, but when a new table is created the topology should meet RF requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE.

closes: #16529

Closes scylladb/scylladb#17079

* github.com:scylladb/scylladb:
  tablets: Make sure topology has enough endpoints for RF
  cql-pytest: Disable tablets when RF > nodes-in-DC
  test: Remove test that configures RF larger than the number of nodes
  keyspace_metadata: Include tablets property in DESCRIBE
2024-02-06 22:38:11 +01:00
Botond Dénes
a3d4131918 Merge 'Sanitize replication factor parsing by strategies' from Pavel Emelyanov
RF values appear as strings and strategies classes convert them to integers. This PR removes some duplication of efforts in converting code.

Closes scylladb/scylladb#17132

* github.com:scylladb/scylladb:
  network_topology_strategy: Do not walk list of datacenters twice
  replication_strategy: Do not convert string RF into int twise
  abstract_replication_strategy: Make validate_replication_factor return value
2024-02-06 13:26:31 +02:00
Asias He
904bafd069 tablets: Convert to use the new version of for_each_tablet
It is more gently than the old one.
2024-02-05 18:45:40 +08:00
Pavel Emelyanov
45dbe38658 tablets: Make sure topology has enough endpoints for RF
When creating a keyspace, scylla allows setting RF value smaller than
there are nodes in the DC. With vnodes, when new nodes are bootstrapped,
new tokens are inserted thus catching up with RF. With tablets, it's not
the case as replica set remains unchanged.

With tablets it's good chance not to mimic the vnodes behavior and
require as many nodes to be up and running as the requested RF is. This
patch implementes this in a lazy manned -- when creating a keyspace RF
can be any, but when a new table is created the topology should meet RF
requirements. If not met, user can bootstrap new nodes or ALTER KEYSPACE.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-05 12:50:04 +03:00
Asias He
fab0d33d08 tablets: Add for_each_tablet_gently
In this version, the callback returns a future<>, so it can yield itself
to avoid stalls in func itself.
2024-02-05 13:42:08 +08:00
Avi Kivity
784c2f8ad2 Merge 'treewide: replace calls to future::get0() by calls to future::get()' from Kefu Chai
get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.

Closes scylladb/scylladb#17130

* github.com:scylladb/scylladb:
  treewide: replace seastar::future::get0() with seastar::future::get()
  sstable: capture return value of get0() using auto
  utils: result_loop: define result_type with decayed type

[avi: add another one that snuck in while this was cooking]
2024-02-04 15:23:33 +02:00
Avi Kivity
7cb1c10fed treewide: replace seastar::future::get0() with seastar::future::get()
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.
2024-02-02 22:12:57 +08:00
Pavel Emelyanov
afda0f6ddf network_topology_strategy: Do not walk list of datacenters twice
Construct of that class walks the provided options to get per-DC
replication factors. It does it twice -- first to populate the dc:rf
map, second to calculate the sum of provided RF values. The latter loop
can be optimized away.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-02 14:39:24 +03:00
Pavel Emelyanov
06f9e7367c replication_strategy: Do not convert string RF into int twise
There are two replication strategy classes that validate string RF and
then convert it into integer. Since validation helper returns the parsed
value, it can be just used avoiding the 2nd conversion.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-02 14:38:17 +03:00
Pavel Emelyanov
a8cd3bc636 abstract_replication_strategy: Make validate_replication_factor return value
The helper in question checks if string RF is indeed an integer. Make
this helper return the "checked" integer value, because it does this
conversion. And rename it to parse_... to reflect what it now does. Next
patches will make use of this change.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-02-02 14:36:47 +03:00
Kefu Chai
b45af994c2 locator/utils: remove stale comment
this comment has already served its purpose when rewriting
C* in C++. since we've re-implemented it, there is no need to keep it
around.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17120
2024-02-02 11:07:35 +02:00
Avi Kivity
c8397f0287 Merge 'Implement tablet splitting' from Raphael "Raph" Carvalho
The motivation for tablet resizing is that we want to keep the average tablet size reasonable, such that load rebalancing can remain efficient. Too large tablet makes migration inefficient, therefore slowing down the balancer.

If the avg size grows beyond the upper bound (split threshold), then balancer decides to split. Split spans all tablets of a table, due to power-of-two constraint.

Likewise, if the avg size decreases below the lower bound (merge threshold), then merge takes place in order to grow the avg size. Merge is not implemented yet, although this series lays foundation for it to be impĺemented later on.

A resize decision can be revoked if the avg size changes and the decision is no longer needed. For example, let's say table is being split and avg size drops below the target size (which is 50% of split threshold and 100% of merge one). That means after split, the avg size would drop below the merge threshold, causing a merge after split, which is wasteful, so it's better to just cancel the split.

Tablet metadata gains 2 new fields for managing this:
resize_type: resize decision type, can be either of "merge", "split", or "none".
resize_seq_number: a sequence number that works as the global identifier of the decision (monotonically increasing, increased by 1 on every new decision emitted by the coordinator).

A new RPC was implemented to pull stats from each table replica, such that load balancer can calculate the avg tablet size and know the "split status", for a given table. Avg size is aggregated carefully while taking RF of each DC into account (which might differ).
When a table is done splitting its storage, it loads (mirror) the resize_seq_number from tablet metadata into its local state (in another words, my split status is ready). If a table is split ready, coordinator will see that table's seq number is the same as the one in tablet metadata. Helps to distinguish stale decisions from the latest one (in case decisions are revoked and re-emited later on). Also, it's aggregated carefully, by taking the minimum among all replicas, so coordinator will only update topology when all replicas are ready.

When load balancer emits split decision, replicas will listen to need to split with a "split monitor" that is awakened once a table has replication metadata updated and detects the need for split (i.e. resize_type field is "split").
The split monitor will start splitting of compaction groups (using mechanism introduced here: 081f30d149) for the table. And once splitting work is completed, the table updates its local state as having completed split.

When coordinator pulls the split status of all replicas for a table via RPC, the balancer can see whether that table is ready for "finalizing" the decision, which is about updating tablet metadata to split each tablet into two. Once table replicas have their replication metadata updated with the new tablet count, they can update appropriately their set of compaction groups (that were previously split in the preparation step).

Fixes #16536.

Closes scylladb/scylladb#16580

* github.com:scylladb/scylladb:
  test/topology_experimental_raft: Add tablet split test
  replica: Bypass reshape on boot with tablets temporarily
  replica: Fix table::compaction_group_for_sstable() for tablet streaming
  test/topology_experimental_raft: Disable load balancer in test fencing
  replica: Remap compaction groups when tablet split is finalized
  service: Split tablet map when split request is finalized
  replica: Update table split status if completed split compaction work
  storage_service: Implement split monitor
  topology_cordinator: Generate updates for resize decisions made by balancer
  load_balancer: Introduce metrics for resize decisions
  db: Make target tablet size a live-updateable config option
  load_balancer: Implement resize decisions
  service: Wire table_resize_plan into migration_plan
  service: Introduce table_resize_plan
  tablet_mutation_builder: Add set_resize_decision()
  topology_coordinator: Wire load stats into load balancer
  storage_service: Allow tablet split and migration to happen concurrently
  topology_coordinator: Periodically retrieve table_load_stats
  locator: Introduce topology::get_datacenter_nodes()
  storage_service: Implement table_load_stats RPC
  replica: Expose table_load_stats in table
  replica: Introduce storage_group::live_disk_space_used()
  locator: Introduce table_load_stats
  tablets: Add resize decision metadata to tablet metadata
  locator: Introduce resize_decision
2024-01-31 13:59:56 +02:00
Tomasz Grabiec
36f218c83b Merge 'main: refuse startup when tablet resharding is required' from Botond Dénes
We do not support tablet resharding yet. All tablet-related code assumes that the (host_id, shard) tablet replica is always valid. Violating this leads to undefined behaviour: errors in the tablet load balancer and potential crashes.
Avoid this by refusing to start if the need to resharding is detected. Be as lenient as possible: check all tablets with a replica on this node, and only refuse startup if at least one tablet has an invalid replica shard.

Startup will fail as:

    ERROR 2024-01-26 07:03:06,931 [shard 0:main] init - Startup failed: std::runtime_error (Detected a tablet with invalid replica shard, reducing shard count with tablet-enabled tables is not yet supported. Replace the node instead.)

Refs: #16739
Fixes: #16843

Closes scylladb/scylladb#17008

* github.com:scylladb/scylladb:
  test/topolgy_experimental_raft: test_tablets.py: add test for resharding
  test/pylib: manager[_client]: add update_cmdline()
  main: refuse startup when tablet resharding is required
  locator: tablets: add check_tablet_replica_shards()
2024-01-29 23:39:41 +01:00
Botond Dénes
95b6aeebae locator: tablets: add check_tablet_replica_shards()
Checks that all tablets with a replica on the this node, have a valid
replica shard (< smp::count).
Will be used to check whether the node can start-up with the current
shard-count.
2024-01-29 07:04:33 -05:00
Pavel Emelyanov
3abdb3c7ee tablets: Remove tablet_aware_replication_strategy::parse_initial_tablets
It's now unused, string with initial tablets its parsed elsewhere

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17010
2024-01-29 10:03:38 +02:00
Raphael S. Carvalho
e0de3dd844 topology_cordinator: Generate updates for resize decisions made by balancer
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:58:40 -03:00
Raphael S. Carvalho
2209c7440c topology_coordinator: Periodically retrieve table_load_stats
This implements the fiber that aggregates per-table stats that will
be feeded into load balancer to make resize decisions (split,
merge, or revoke ongoing ones).

Initially, the stats will be refreshed every 60s, but the idea
is that eventually we make the frequency table based, where
the size of each table is taken into account.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
489a527e20 locator: Introduce topology::get_datacenter_nodes()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
6c74fc4b82 locator: Introduce table_load_stats
This is per table stats that will be aggregated from all nodes, by
the coordinator, in order to help load balancer make resize
decisions.

size_in_bytes is the total aggregated table size, so coordinator
becomes responsible for taking into account RF of each DC and
also tablet count, for computing an accurate average size.

split_ready_seq_number is the minimum sequence number among all
replicas. If coordinator sees all replicas store the seq number
of current split, then it knows all replicas are ready for the
next stage in the split process.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:08 -03:00
Raphael S. Carvalho
0d5ba1ee4b tablets: Add resize decision metadata to tablet metadata
The new metadata describes the ongoing resize operation (can be either
of merge, split or none) that spans tablets of a given table.
That's managed by group0, so down nodes will be able to see the
decision when they come back up and see the changes to the
metadata.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:36:06 -03:00
Raphael S. Carvalho
57582ac9c4 locator: Introduce resize_decision
resize_decision is the metadata the says whether tablets of a table
needs split, merge, or none. That will be recorded in tablet metadata,
and therefore stored in group0.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-01-25 18:31:12 -03:00
Avi Kivity
69d597075a Merge 'tablets: Add support for removenode and replace handling' from Tomasz Grabiec
New tablet replicas are allocated and rebuilt synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.

Tablet scheduler is responsible for scheduling tablet rebuilding transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.

Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.

New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.

In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.

Fixes https://github.com/scylladb/scylladb/issues/16690.

Closes scylladb/scylladb#16894

* github.com:scylladb/scylladb:
  tests: tablets: Add tests for removenode and replace
  tablets: Add support for removenode and replace handling
  topology_coordinator: tablets: Do not fail in a tight loop
  topology_coordinator: tablets: Avoid warnings about ignored failured future
  storage_service, topology: Track excluded state in locator::topology
  raft topology: Introduce param-less topology::get_excluded_nodes()
  raft topology: Move get_excluded_nodes() to topology
  tablets: load_balancer: Generalize load tracking
  tablets: Introduce get_migration_streaming_info() which works on migration request
  tablets: Move migration_to_transition_info() to tablets.hh
  tablets: Extract get_new_replicas() which works on migraiton request
  tablets: Move tablet_migration_info to tablets.hh
  tablets: Store transition kind per tablet
2024-01-25 14:49:43 +02:00
Botond Dénes
26d814d8be Merge 'Configure initial tablets count scaling' from Pavel Emelyanov
There are currently two options how to "request" the number of initial tables for a table

1. specify it explicitly when creating a keyspace
2. let scylla calculate it on its own

Both are not very nice. The former doesn't take cluster layout into consideration. The latter does, but starts with one tablet per shard, which can be too low if the amount of data grows rapidly.

Here's a (maybe temporary) proposal to facilitate at least perf tests -- the --tablets-initial-scale-factor option that enhances the option number two above by multiplying the calculated number of tablets by the configured number. This is what we currently do to run perf tests by patching scylla, with the option it going to be more convenient.

Closes scylladb/scylladb#16919

* github.com:scylladb/scylladb:
  config: Add --tablets-initial-scale-factor
  tablet_allocator: Add initial tablets scale to config
  tablet_allocator: Add config
2024-01-23 13:25:12 +02:00
Kefu Chai
76b9e4f4f4 locator: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16914
2024-01-23 09:12:23 +02:00
Tomasz Grabiec
e5dcf03b88 tablets: Add support for removenode and replace handling
New tablet replicas are allocated synchronously with node
operations. They are safely rebuilt from all existing replicas.
The list of ignored nodes passed to node operations is respected.

Tablet scheduler is responsible for scheduling tablet transition which
changes the replicas set. The infrastructure for handling decommission
in tablet scheduler is reused for this.

Scheduling is done incrementally, respecting per-shard load
limits. Rebuilding transitions are recognized by load calculation to
affect all tablet replicas.

New kind of tablet transition is introduced called "rebuild" which
adds new tablet replica and rebuilds it from existing replicas. Other
than that, the transition goes through the same stages as regular
migration to ensure safe synchronization with request coordinators.

In this PR we simply stream from all tablet replicas. Later we should
switch to calling repair to avoid sending excessive amounts of data.

Fixes #16690.
2024-01-23 01:19:42 +01:00
Tomasz Grabiec
5fccee3a13 storage_service, topology: Track excluded state in locator::topology
Will be used by tablet load balancer to avoid excluded nodes in
scheduling.
2024-01-23 01:12:58 +01:00
Tomasz Grabiec
649ca0e46c tablets: Introduce get_migration_streaming_info() which works on migration request
Will be used by tablet load balancer to compute impact on load of
planned migrations. Currently, the logic is hard coded in the load
balancer and may get out of sync with the logic we have in
get_migration_streaming_info() for already running tablet transitions.

The logic will become more complex for rebuild transition, so use
shared code to compute it.
2024-01-23 01:12:57 +01:00
Tomasz Grabiec
6dc56fd80b tablets: Move migration_to_transition_info() to tablets.hh 2024-01-23 01:12:57 +01:00
Tomasz Grabiec
1df256221c tablets: Extract get_new_replicas() which works on migraiton request
Now we have a single place which translates tablet migration request to new
replicas.

Will be reused in other places.
2024-01-23 01:12:57 +01:00
Tomasz Grabiec
ae382196f1 tablets: Move tablet_migration_info to tablets.hh
Will add methods which operate on it to tablets.hh where they belong.
2024-01-23 01:12:57 +01:00
Tomasz Grabiec
4a06ffb43c tablets: Store transition kind per tablet
Will be used to distinguish regular migration from rebuild, repair and
RF change.
2024-01-23 01:12:57 +01:00
Pavel Emelyanov
eb3b237e05 tablet_allocator: Add initial tablets scale to config
When allocating tablets for table for the frist time their initial count
is calculated so that each shard in a cluster gets one tablet. It may
happen that more than one initial tablet per shard is better, e.g. perf
tests typically rely on that.

It's possible to specify the initial tablets count when creating a
keyspace, this number doesn't take the cluster topology into
consideration and may also be not very nice.

As a temporary solution (e.g. for perf tests) we may add a configurable
that scales the initial number of calculated tablets by some factor

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-01-22 19:14:45 +03:00
Kefu Chai
ce076b5ae3 gossiping_property_file_snitch: drop unused using namespace
we don't use any symbol in this namespace, in this function, so drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16893
2024-01-21 16:48:37 +02:00
Pavel Emelyanov
8595d64d01 locator: Handle replication factor of 0 for initial_tablets calculations
When calculating per-DC tablets the formula is shards_in_dc / rf_in_dc,
but the denominator in it can be configured to be literally zero and the
division doesn't work.

Fix by assuming zero tablets for dcs with zero rf

fixes: #16844

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#16861
2024-01-18 19:42:08 +02:00
Kefu Chai
0ae81446ef ./: not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16766
2024-01-17 16:30:14 +02:00
Pavel Emelyanov
941f6d8fca cql: Move initial_tablets from REPLICATION to TABLETS in DDL
This patch changes the syntax of enabling tablets from

  CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> }

to be

  CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }

and updates all tests accordingly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-01-15 13:04:48 +03:00
Pavel Emelyanov
4c4a9679d8 network_topology_strategy: Estimate initial_tablets if 0 is set
If user configured zero initial tablets (spoiler: or this value was set
automagically when enabling tablets begind the scenes) we still need
some value to start with and this patch calculates one.

The math is based on topology and RF so that all shards are covered:

initial_tablets = max(nr_shards_in(dc) / RF_in(dc) for dc in datacenters)

The estimation is done when a table is created, not when the keyspace is
created. For that, the keyspace is configured with zero initial tabled,
and table-creation time zero is converted into auto-estimated value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-01-15 13:04:48 +03:00
Gleb Natapov
23a27ccc24 vnode_effective_replication_map: add get_all_pending_nodes() function
Add a function that returns all nodes that have vnode been moved to them
during a topology change operation. Needed to know which nodes need to
do cleanup in case of failed topology change operation.
2024-01-14 14:37:16 +02:00
Gleb Natapov
a8f11852da vnode_effective_replication_map: pre calculate dirty endpoints during topology change
Some topology change operations causes some nodes loose ranges. This
information is needed to know which nodes need to do cleanup after
topology operation completes. Pre calculate it during erm creation.
2024-01-14 14:11:19 +02:00
Petr Gusev
41c15814e6 erm: for_each_natural_endpoint_until: use is_vnode == true
This is an optimisation - for_each_natural_endpoint_until is
called only for vnode tokens, we don't need to run the
binary search for it in tm.first_token.

Also the function is made private since it's only used
in erm itself.
2024-01-12 12:23:22 +04:00
Petr Gusev
07f2ec63c7 erm: switch the internal data structures to host_id-s
Before this patch the host_id -> IP mapping was done
in calculate_effective_replication_map. This function
is called from mutate_token_metadata, which means we
have to have an IP for each host_id in topology_state_load,
otherwise we get an error. We are going to remove
the IP waiting loop from topology_state_load, so
we need to get rid of IPs resolution from
calculate_effective_replication_map.

In this patch we move the host_id -> IP resolution to
the data plane. When a write or read request is sent
the target endpoints are requested from erm through
get_natural_endpoints_without_node_being_replaced,
get_pending_endpoints and get_endpoints_for_reading
methods and this is where the IP resolution
will now occur.
2024-01-12 12:23:22 +04:00