Commit Graph

989 Commits

Author SHA1 Message Date
Tomasz Grabiec
1e407ab4d2 tablets: Equalize per-table balance when allocating tablets for a new table
Fixes the following scenario:

1. Scale out adds new nodes to each rack
2. Table is created - all tablets are allocated to new nodes because they have low load
3. Rebalancing moves tablets from old nodes to new nodes - table balance for the new table is not fixed

We're wrong to try to equalize global load when allocating tablets,
and we should equalize per-table load instead, and let background load
balancing fix it in a fair way. It will add to the allocated storage
imbalance, but:

1. The table is initially empty, so doesn't impact actual storage imbalance.
2. It's more important to avoid overloading CPU on the nodes - imbalance hurts this aspect immediately.
3. If the table was created before imbalance was formed, we would end up in the same situation in the problematic scenario after the patch.
4. It's the job of the load balancing to keep up with storage growing, and if it's not, scale out should kick in.

Before we have CPU-aware tablet allocation, and thus can prove we have
CPU capacity on the small nodes, we should respect per-table balance
as this is the way in which we achieve full CPU utilization.

Fixes #23631
2025-04-17 16:01:23 +02:00
Tomasz Grabiec
2597a7e980 load_sketch: Tolerate missing tablet_map when selecting for a given table
To simplify future usage in
network_topology_strategy::add_tablets_in_dc() which invokes
populate() for a given table, which may be both new and preexisitng.
2025-04-17 16:01:16 +02:00
Benny Halevy
e1fe82ed33 utils: phased_barrier, pluggable: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:47:00 +03:00
Tomasz Grabiec
b5211cca85 Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

Closes scylladb/scylladb#23187

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-09 21:35:37 +02:00
Benny Halevy
dfdca2d84e locator: topology: drop unused calculate_datacenters
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#23647
2025-04-08 19:04:56 +03:00
Aleksandra Martyniuk
acd32b24d3 locator: service: move to rebuild_v2 transition if cluster is upgraded
If cluster is upgraded to version containing rebuild_v2 transition
kind, move to this transition kind instead of rebuild.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
4a847df55c locator: service: add rebuild_repair tablet transition stage
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

rebuild_repair is a stage that will be used to perform the repair
phase. It executes the tablet repair on tablet_info::replicas.
A primary replica out of migration_streraming_info::read_from is
the repair master. If the repair succeeds, we move to streaming
tablet transition stage, and to cleanup_target - if it fails.

The repair bypasses the tablet repair scheduler and it does not update
the repair_time.

A transition to the rebuild_repair stage will be added in the following
patches.
2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
5d6041617b locator: add maybe_get_primary_replica
Add maybe_get_primary_replica to choose a primary replica out of
custom replica set.
2025-04-08 10:42:01 +02:00
Aleksandra Martyniuk
ed7b8bb787 locator: service: add rebuild_v2 tablet transition kind
Currently, in the streaming stage of rebuild tablet transition,
we stream tablet data from all replicas.
This patch series splits the streaming stage into two phases:
- repair phase, where we repair the tablet;
- streaming phase, where we stream tablet data from one replica.

To differentiate the two streaming methods, a new tablet transition
kind - rebuild_v2 - is added.

The transtions and stages for rebuild_v2 transition kind will be
added in the following patches.
2025-04-08 10:42:01 +02:00
Botond Dénes
1198213000 Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

Closes scylladb/scylladb#23478

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-03 16:32:53 +03:00
Gleb Natapov
4609bbbbb2 treewide: move gossiper to index nodes by host id
This patch changes gossiper to index nodes by host ids instead of ips.
The main data structure that changes is _endpoint_state_map, but this
results in a lot of changes since everything that uses the map directly
or indirectly has to be changed. The big victim of this outside of the
gossiper itself is topology over gossiper code. It works on IPs and
assumes the gossiper does the same and both need to be changed together.
Changes to other subsystems are much smaller since they already mostly
work on host ids anyway.
2025-03-31 16:50:50 +03:00
Tomasz Grabiec
6bff596fce tablets: Make tablet allocation equalize per-shard load
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogenous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378
2025-03-31 14:34:30 +02:00
Dawid Mędrek
32879ec0d5 db/config: Introduce RF-rack-valid keyspaces
We introduce a new term in the glossary: RF-rack-valid keyspace.

We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.

Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.

The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.

Fixes scylladb/scylladb#20356
2025-03-19 14:46:35 +01:00
Tomasz Grabiec
c4714180cc tablets: Make load balancing capacity-aware
Before this patch the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
 1) tablets have the same size
 2) shards have the same capacity

That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogenous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.

After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assummes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.

One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
2025-03-06 13:35:38 +01:00
Tomasz Grabiec
7e7f1e6f91 storage_service, tablets: Collect per-node capacity in load_stats
New RPC is introduced becuase load_stats was marked "final" in the IDL.

Will be needed by capacity-aware load balancing.
2025-03-06 12:17:32 +01:00
Aleksandra Martyniuk
2b538d228c locator: add round-robin selection of filtered replicas 2025-02-28 12:32:55 +01:00
Aleksandra Martyniuk
fe4e99d7b3 locator: add tablet_task_info::selected_by_filters
Extract dcs and hosts filters check to a method.
2025-02-28 12:02:21 +01:00
Avi Kivity
d99df7af6c Merge 'Respect per-shard tablet goal and 10x default per-shard tablet count' from Tomasz Grabiec
This series achieves two things:

1) changes default number of tablet replicas per shard to be 10 in order to reduce load imbalance between shards

    This will result in new tables having at least 10 tablet replicas per
    shard by default.

    We want this to reduce tablet load imbalance due to differences in
    tablet count per shard, where some shards have 1 tablet and some
    shards have 2 tablets. With higher tablet count per shard, this
    difference-by-one is less relevant.

    Fixes https://github.com/scylladb/scylladb/issues/21967

2) introduces a global goal for tablet replica count per shard and adds logic to tablet scheduler to respect it by controlling per-table tablet count

    The per-shard goal is enforced by controlling average per-shard tablet replica
    count in a given DC, which is controlled by per-table tablet
    count. This is effective in respecting the limit on individual shards
    as long as tablet replicas are distributed evenly between shards.
    There is no attempt to move tablets around in order to enforce limits
    on individual shards in case of imbalance between shards.

    If the average per-shard tablet count exceeds the limit, all tables
    which contribute to it (have replicas in the DC) are scaled down
    by the same factor. Due to rounding up to the nearest power of 2,
    we may overshoot the per-shard goal by at most a factor of 2.

    The scaling is applied after computing desired tablet count due to
    all other factors: per-table tablet count hints, defaults, average tablet size.

    If different DCs want different scale factors of a given table, the
    lowest scale factor is chosen for a given table.

    When creating a new table, its tablet count is determined by tablet
    scheduler using the scheduler logic, as if the table was already created.
    So any scaling due to per-shard tablet count goal is reflected immediately
    when creating a table. It may however still take some time for the system
    to shrink existing tables. We don't reject requests to create new tables.

    Fixes #21458

Closes scylladb/scylladb#22522

* github.com:scylladb/scylladb:
  config, tablets: Allow tablets_initial_scale_factor to be a fraction
  test: tablets_test: Test scaling when creating lots of tables
  test: tablets_test: Test tablet count changes on per-table option and config changes
  test: tablets_test: Add support for auto-split mode
  test: cql_test_env: Expose db config
  config: Make tablets_initial_scale_factor live-updateable
  tablets: load_balancer: Pick initial_scale_factor from config
  tablets, load_balancer: Fix and improve logging of resize decisions
  tablets, load_balancer: Log reason for target tablet count
  tablets: load_balancer: Move hints processing to tablet scheduler
  tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
  tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
  tablets: load_balancer: Determine desired count from size separately from count from options
  tablets: load_balancer: Determine resize decision from target tablet count
  tablets: load_balancer: Allow splits even if table stats not available
  tablets: load_balancer: Extract make_sizing_plan()
  tablets: Add formatter for resize_decision::way_type
  tablets: load_balancer: Simplify resize_urgency_cmp()
  tablets: load_balancer: Keep config items as instance members
  locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology()
  tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
  tablets: Set default initial tablet count scale to 10
  tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
  tablets: load_balancer: Extract get_schema_and_rs()
  tablets: load_balancer: Drop test_mode
2025-02-24 17:59:26 +02:00
Raphael S. Carvalho
4d8a333a7f storage_service: Don't retry split when table is dropped
The split monitor wasn't handling the scenario where the table being
split is dropped. The monitor would be unable to find the tablet map
of such a table, and the error would be treated as a retryable one
causing the monitor to fall into an endless retry loop, with sleeps
in between. And that would block further splits, since the monitor
would be busy with the retries. The fix is about detecting table
was dropped and skipping to the next candidate, if any.

Fixes #21859.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#22933
2025-02-20 10:13:55 +01:00
Tomasz Grabiec
029505b179 tablets: load_balancer: Move hints processing to tablet scheduler
Hints have common meaning for all strategies, so the logic
belongs more to make_sizing_plan().

As a side effect, we can reuse shard capacity computation across
tables, which reduces computational complexity from O(tables*nodes) to
O(tables * DCs + nodes)
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
f1bda8d4c1 tablets: load_balancer: Scale down tablet count to respect per-shard tablet count goal
The limit is enforced by controlling average per-shard tablet replica
count in a given DC, which is controlled by per-table tablet
count. This is effective in respecting the limit on individual shards
as long as tablet replicas are distributed evenly between shards.

There is no attempt to move tablets around in order to enforce limits
on individual shards in case of imbalance between shards.

If the average per-shard tablet count exceeds the limit, all tables
which contribute to it (have replicas in the DC) are scaled down
by the same factor. Due to rounding up to the nearest power of 2,
we may overshoot the per-shard goal by at most a factor of 2.

If different DCs want different scale factors of a given table, the
lowest scale factor is chosen for a given table.

The limit is configurable. It's a global per-cluster config which
controls how many tablet replicas per shard in total we consider to be
still ok. It controls tablet allocator behavior, when choosing initial
tablet count. Even though it's a per-node config, we don't support
different limits per node. All nodes must have the same value of that
config. It's similar in that regard to other scheduler config items
like tablets_initial_scale_factor and target_tablet_size_in_bytes.
2025-02-19 16:29:07 +01:00
Tomasz Grabiec
94b5165ac7 tablets: Use scheduler's make_sizing_plan() to decide about tablet count of a new table
This makes decisions made by the scheduler consistent with decisions
made on table creation, with regard to tablet count.

We want to avoid over-allocation of tablets when table is created,
which would then be reduced by the scheduler's scaling logic. Not just
to avoid wasteful migrations post table creation, but to respect the
per-shard goal. To respect the per-shard goal, the algorithm will no
longer be as simple as looking at hints, and we want to share the
algorithm between the scheduler and initial tablet allocator. So
invoke the scheduler to get the tablet count when table is created.
2025-02-19 14:40:07 +01:00
Tomasz Grabiec
d3ffea77e6 tablets: load_balancer: Extract make_sizing_plan()
Resize plan making will now happen in two stages:

  1) Determine desired tablet counts per table (sizing plan)
  2) Schedule resize decisions

We need intermediate step in the resize plan making, which gives us
the planned tablet counts, so that we can plug this part of the
algorithm into initial tablet allocation on table construction.

We want decisisons made by the scheduler to be consistent with
decisions made on table creation. We want to avoid over-allocation of
tablets when table is created, which would then be reduced by the
scheduler. Not just to avoid wasteful migrations post table creation,
but to respect the per-shard goal. To respect the per-shard goal, the
algorithm will no longer be as simple as looking at hints, and we want
to share the algorithm between the scheduler and initial tablet
allocator.

Also, this sizing plan will be later plugged into a virtual table for
observability.
2025-02-19 14:40:06 +01:00
Tomasz Grabiec
33db0d4fea tablets: Add formatter for resize_decision::way_type 2025-02-19 14:39:40 +01:00
Tomasz Grabiec
ce959818a3 locator: network_topology_strategy: Simplify calculate_initial_tablets_from_topology() 2025-02-19 14:38:50 +01:00
Tomasz Grabiec
f043c83ba5 tablets: Change the meaning of initial_scale to mean min-avg-tablets-per-shard
Currently the scale is applied post rounding up of tablet count so
that tablet count per shard is at least 1. In order to be able to use
the scale to increase tablet count per shard, we need to apply it
prior to division by RF, otherwise we will overshoot per-shard tablet
replica count.

Example:

  4 nodes, -c1, rf=3, initial_tablets_scale=10

  Before: initial_tablet_count=20, tablet-per-shard=15
  After: initial_tablet_count=14, tablets-per-shard=10.5
2025-02-19 14:38:50 +01:00
Tomasz Grabiec
8eedb551b5 tablets: network_topology_stragy: Coroutinize calculate_initial_tablets_from_topology()
To insert preemption points later.
2025-02-19 14:38:49 +01:00
Botond Dénes
3439d015cb Merge 'repair: Introduce Host and DC filter support' from Aleksandra Martyniuk
Currently, the tablet repair scheduler repairs all replicas of a tablet. It does not support hosts or DCs selection. It should be enough for most cases. However, users might still want to limit the repair to certain hosts or DCs in production. https://github.com/scylladb/scylladb/pull/21985 added the preparation work to add the config options for the selection. This patch adds the hosts or DCs selection support.

Fixes https://github.com/scylladb/scylladb/issues/22417

New feature. No backport is needed.

Closes scylladb/scylladb#22621

* github.com:scylladb/scylladb:
  test: add test to check dcs and hosts repair filter
  test: add repair dc selection to test_tablet_metadata_persistence
  repair: Introduce Host and DC filter support
  docs: locator: update the docs and formatter of tablet_task_info
2025-02-17 10:04:09 +02:00
Calle Wilund
342df0b1a8 network_topology_strategy/alter ks: Remove dc:s from options once rf=0
Fixes #22688

If we set a dc rf to zero, the options map will still retain a dc=0 entry.
If this dc is decommissioned, any further alters of keyspace will fail,
because the union of new/old options will now contained an unknown keyword.

Change alter ks options processing to simply remove any dc with rf=0 on
alter, and treat this as an implicit dc=0 in nw-topo strategy.
This means we change the reallocate_tablets routine to not rely on
the strategy objects dc mapping, but the full replica topology info
for dc:s to consider for reallocation. Since we verify the input
on attribute processing, the amount of rf/tablets moved should still
be legal.

v2:
* Update docs as well.
v3:
* Simplify dc processing
* Reintroduce options empty check, but do early in ks_prop_defs
* Clean up unit test some

Closes scylladb/scylladb#22693
2025-02-15 20:32:22 +02:00
Asias He
5545289bfa repair: Introduce Host and DC filter support
Currently, the tablet repair scheduler repairs all replicas of a tablet.
It does not support hosts or DCs selection. It should be enough for most
cases. However, users might still want to limit the repair to certain
hosts or DCs in production. #21985 added the preparation work to add the
config options for the selection. This patch adds the hosts or DCs
selection support.

Fixes #22417
2025-02-14 09:13:11 +01:00
Aleksandra Martyniuk
4c75701756 docs: locator: update the docs and formatter of tablet_task_info 2025-02-14 09:13:11 +01:00
Kefu Chai
a18069fad7 service: migrate from boost::range::remove_if() to std::ranges::remove_if
Replace boost::range::remove_if() with the standard library's
std::ranges::remove_if() to reduce external dependencies and simplify
the codebase. This change eliminates the requirement for boost::range
and makes the implementation more maintainable.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-02-11 09:15:14 +08:00
Benny Halevy
021fc3c756 tablets: resize_decision: get rid of initial_decision
Now, with tablet_hints calculation of min_tablet_count
it is not used anymore.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 18:43:47 +02:00
Benny Halevy
20c6ca2813 tablet_allocator: consider tablet options for resize decision
Do not merge tablets if that would drop the tablet_count
below the minimum provided by hints.
Split tablets if the current tablet_count is less than
the minimum tablet count calculated using the table's tablet options.

TODO: override min_tablet_count if the tablet count per shard
is greater than the maximum allowed.  In this case
the tables tablet counts should be scaled down proportionally.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 18:43:35 +02:00
Benny Halevy
32c2f7579f network_topology_strategy: allocate_tablets_for_new_table: consider tablet options
Use the keyspace initial_tablets for min_tablet_count, if the latter
isn't set, then take the maximum of the option-based tablet counts:
- min_tablet_count
- and expected_data_size_in_gb / target_tablet_size
- min_per_shard_tablet_count (via
  calculate_initial_tablets_from_topology)
If none of the hints produce a positive tablet_count,
fall back to calculate_initial_tablets_from_topology * initial_scale.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:59:32 +02:00
Benny Halevy
86bcf4cffe network_topology_strategy: calculate_initial_tablets_from_topology: precalculate shards per dc using for_each_token_owner
Current implementation is inefficient as it calls
get_datacenter_token_owners_ips and then find_node(ep)
while for_each_node easily provides a host_id for
is_normal_token_owner.

Then, since we're interested only in datacenters
configure with a replication factor (but it still might be 0),
simply iterate over the dc->rf map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:59:30 +02:00
Benny Halevy
49dacb1d52 network_topology_strategy: calculate_initial_tablets_from_topology: set default rf to 0
Currently, if a datacenter has no replication_factor option
we consider its replication factor to be 1 in
calculate_initial_tablets_from_topology, but since
we're not going to have any replica on it, it should
be 0.

This is very minor since in the worst case, it
will pessimize the calculation and calculate a value
for initial_tablets that's higher than it could be.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-02-06 08:55:51 +02:00
Tomasz Grabiec
3bb19e9ac9 locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
For example, nodes which are being decommissioned should not be
consider as available capacity for new tables. We don't allocate
tablets on such nodes.

Would result in higher per-shard load then planned.

Closes scylladb/scylladb#22657
2025-02-05 23:59:41 +02:00
Tomasz Grabiec
e22e3b21b1 locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
In that case, new_racks will be used, but when we discover no
candidates, we try to pop from existing_racks.

Fixes #22625

Closes scylladb/scylladb#22652
2025-02-05 20:13:05 +02:00
Pavel Emelyanov
83f3821f99 Merge 'cql: clean the code validating replication strategy options' from Piotr Smaron
Clean the code validating if a replication strategy can be used.
This PR consists of a bunch of unmerged https://github.com/scylladb/scylladb/pull/20088 commits - the solution to the problem that the linked PR tried to solve has been accomplished in another PR, leaving the refactor commits unmerged. The commits introduced in this PR have already been reviewed in the old PR.

No need to backport, it's just a refactor.

Closes scylladb/scylladb#22516

* github.com:scylladb/scylladb:
  cql: restore validating replication strategies options
  cql: change validating NetworkTopologyStrategy tags to internal_error
  cql: inline abstract_replication_strategy::validate_replication_strategy
  cql: clean redundant code validating replication strategy options
2025-02-05 11:18:50 +03:00
Piotr Smaron
2953d3ebe0 cql: restore validating replication strategies options
`validate_options` needs to be extended with
`topology` parameter, because NetworkTopologyStrategy needs to validate if every
explicitly listed DC is really existing. I did cut corner a bit and
trimmed the message thrown when it's not the case, just to avoid passing
and extra parameter (ks name) to the `validate_options`
function, as I find the longer message to be a bit redundant (the driver will
receive info which KS modification failed).
The tests that have been commented out in the previous commit have been
restored.
2025-02-04 12:27:33 +01:00
Piotr Smaron
100e8d2856 cql: change validating NetworkTopologyStrategy tags to internal_error
The check for `replication_factor` tag in
`network_topology_strategy::validate_options` is redundant for 2 reasons:
- before we reach this part of the code, the `replication_factor` tag
  is replaced with specific DC names
- we actually do allow for `replication_factor` tag in
  NetworkTopologyStrategy for keyspaces that have tablets disabled.

This code is unreachable, hence changing it to an internal error, which
means this situation should never occur.
The place that unrolls `replication_factor` tag checked for presence of
this tag ignoring the case, which lead to an unexpected behaviour:
- `replication_factor` tag (note the lowercase) was unrolled, as
  explained above,
- the same tag but written in any other case resulted in throwing a vague
  message: "replication_factor is an option for SimpleStrategy, not
NetworkTopologyStrategy".

So we're changing this validation to accept and unroll only the
lowercase version of this tag. We can't ignore the case here, as this
tag is present inside a json, and json is case-sensitive, even though the
CQL itself is case insensitive.

Added a test that passes for both scylla and cassandra.

Fixes: #15336
2025-02-04 12:27:29 +01:00
Botond Dénes
af46894bb7 Merge 'Rack aware view pairing' from Benny Halevy
Enabled with the tablets_rack_aware_view_pairing cluster feature
rack-aware pairing pairs base to view replicas that are in the
same dc and rack, using their ordinality in the replica map

We distinguish between 2 cases:
- Simple rack-aware pairing: when the replication factor in the dc
  is a multiple of the number of racks and the minimum number of nodes
  per rack in the dc is greater than or equal to rf / nr_racks.

  In this case (that includes the single rack case), all racks would
  have the same number of replicas, so we first filter all replicas
  by dc and rack, retaining their ordinality in the process, and
  finally, we pair between the base replicas and view replicas,
  that are in the same rack, using their original order in the
  tablet-map replica set.

  For example, nr_racks=2, rf=4:
  base_replicas = { N00, N01, N10, N11 }
  view_replicas = { N11, N12, N01, N02 }
  pairing would be: { N00, N01 }, { N01, N02 }, { N10, N11 }, { N11, N12 }
  Note that we don't optimize for self-pairing if it breaks pairing ordinality.

- Complex rack-aware pairing: when the replication factor is not
  a multiple of nr_racks.  In this case, we attempt best-match
  pairing in all racks, using the minimum number of base or view replicas
  in each rack (given their global ordinality), while pairing all the other
  replicas, across racks, sorted by their ordinality.

  For example, nr_racks=4, rf=3:
  base_replicas = { N00, N10, N20 }
  view_replicas = { N11, N21, N31 }
  pairing would be: { N00, N31 }\*, { N10, N11 }, { N20, N21 }
  \* cross-rack pair

  If we'd simply stable-sort both base and view replicas by rack,
  we might end up with much worse pairing across racks:
  { N00, N11 }\*, { N10, N21 }\*, { N20, N31 }\*
  \* cross-rack pair

Fixes scylladb/scylladb#17147

* This is an improvement so no backport is required

Closes scylladb/scylladb#21453

* github.com:scylladb/scylladb:
  network_topology_strategy_test: add tablets rack_aware_view_pairing tests
  view: get_view_natural_endpoint: implement rack-aware pairing for tablets
  view: get_view_natural_endpoint: handle case when there are too few view replicas
  view: get_view_natural_endpoint: track replica locator::nodes
  locator: topology: consult local_dc_rack if node not found by host_id
  locator: node: add dc and rack getters
  feature_service: add tablet_rack_aware_view_pairing feature
  view: get_view_natural_endpoint: refactor predicate function
  view: get_view_natural_endpoint: clarify documentation
  view: mutate_MV: optimize remote_endpoints filtering check
  view: mutate_MV: lookup base and view erms synchronously
  view: mutate_MV: calculate keyspace-dependent flags once
2025-01-30 11:32:19 +02:00
Piotr Smaron
2a77405093 cql: inline abstract_replication_strategy::validate_replication_strategy
This function is only called from 1 place and only contains 2 lines of
code, just keeping it increases the code bloat
2025-01-27 12:13:45 +01:00
Piotr Smaron
3848293a43 cql: clean redundant code validating replication strategy options
Most of the code from `recognized_options` is either incorrect or lacks
any implementation, for example:
- comments for Everywhere and Local strategies are contradictory, first
  says to allow all options, second says that the strategy doesn't accept
any options, even though both functions have the same implementation,
- for Local & Everywhere strategies the same logic is repeated in
  `validate_options` member functions, i.e. this function does nothing,
- for NetworkTopology this function returns DC names and tablet options, but tablet
  options are empty; OTOH this strategy also accepts 'replication_factor'
tag, which was ommitted,
- for SimpleStrategy this function returns `replication_factor`, but this is also validated
  in `validate_options` function called just before the removed
function.
All of it makes `validate_replication_strategy` work incorrectly.
That being said, 3 tests fail because of this logic's removal, so it did
something after all. The failing tests are commented out, so that the CI
passes, and will be restored in the next commit(s).
2025-01-27 12:01:59 +01:00
Aleksandra Martyniuk
54e7f2819c test: add test to check if repair handles no_such_keyspace 2025-01-27 09:49:50 +01:00
Benny Halevy
2589115337 locator: topology: consult local_dc_rack if node not found by host_id
Like get_location by inet_address, there is a case, when
a node is replaced that the node cannot be found by host_id.
Currently get_location would return a reference based on the nullptr
which might cause a segfault as seen in testing.  Instead,
if the host_id is of the location, revert to calling
get_location() which consults this_node or _cfg.local_dc_rack.
Otherwise, throw a runtime_error.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Benny Halevy
2bfebc1f62 locator: node: add dc and rack getters
To simplify searching and sorting using std::ranges
projection using std::mem_fn.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-01-22 09:04:24 +02:00
Botond Dénes
47989b1503 Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).

Users can see running resize tasks - finished tasks are not
presented with the task manager API.

A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.

Fixes: #21366.
Fixes: #21367.

No backport, new feature

Closes scylladb/scylladb#21891

* github.com:scylladb/scylladb:
  test: boost: check resize_task_info in tablet_test.cc
  test: add tests to check revoked resize virtual tasks
  test: add tests to check the list of resize virtual tasks
  test: add tests to check spilt and merge virtual tasks status
  test: test_tablet_tasks: generalize functions
  replica: service: add split virtual task's children
  replica: service: pass parent info down to storage_group::split
  tasks: children of virtual tasks aren't internal by default
  tasks: initialize shard in task_info ctor
  service: extend tablet_virtual_task::abort
  service: retrun status_helper struct from tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::wait
  tasks: add suspended task state
  service: extend tablet_virtual_task::get_status
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: add service::task_manager_module::get_nodes
  tasks: add task_manager::get_nodes
  tasks: drop noexcept from module::get_nodes
  replica: service: add resize_task_info static column to system.tablets
  locator: extend tablet_task_info to cover resize tasks
2025-01-17 14:24:07 +02:00
Gleb Natapov
1e4b2f25dc locator: token_metadata: drop update_host_id() function that does nothing now 2025-01-16 16:37:08 +02:00