The `table::do_apply()` method verifies if the compaction group's async
gate is open to determine if the compaction group is active. Closing
this async gate prevents any new operations but waits for existing
holders to exit, allowing their operations to complete. When holding a
gate, holders will observe the gate as closed when it is being closed,
but this is irrelevant as they are already inside the gate and are
allowed to complete. All the callers of `table::do_apply()` already
enter the gate before calling the method. So, the async gate check
inside `table::do_apply()` will erroneously throw an exception when the
compaction group is closing despite holding the gate. This commit
removes the check to prevent this from happening.
Fixes#23348
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#23579
(cherry picked from commit 750f4baf44)
Closesscylladb/scylladb#23645
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.
Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.
Fixes: #23453.
Needs backport to 2025.1 that introduces the tablet repair scheduler.
- (cherry picked from commit 1dc29ddc86)
- (cherry picked from commit bae6711809)
Parent PR: #23455Closesscylladb/scylladb#23580
* github.com:scylladb/scylladb:
\test: add test to check concurrent migration and repair of two different tablets
repair: release erm in repair_writer_impl::create_writer when possible
The "make-pr-ready-for-review" workflow was failing with an "Input
required and not supplied: token" error. This was due to GitHub Actions
security restrictions preventing access to the token when the workflow
is triggered in a fork:
```
Error: Input required and not supplied: token
```
This commit addresses the issue by:
- Running the workflow in the base repository instead of the fork. This
grants the workflow access to the required token with write permissions.
- Simplifying the workflow by using a job-level `if` condition to
controlexecution, as recommended in the GitHub Actions documentation
(https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/using-conditions-to-control-job-execution).
This is cleaner than conditional steps.
- Removing the repository checkout step, as the source code is not required for this workflow.
This change resolves the token error and ensures the
"make-pr-ready-for-review" workflow functions correctly.
Fixesscylladb/scylladb#22765
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22766
(cherry picked from commit ca832dc4fb)
Closesscylladb/scylladb#23561
Before this patch, granting a user MODIFY permissions on ALL KEYSPACES allowed the user to write to system tables, where the user could also set himself to "superuser" granting him all other permissions. After this patch, MODIFY permissions on ALL KEYSPACES is limited only to non-system keyspaces.
Fixes: scylladb/scylladb#23218
(cherry picked from commit fee50f287c)
Parent PR: #23219Closesscylladb/scylladb#23594
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.
We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.
So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.
Fixes#23569.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23578
(cherry picked from commit a9a6f9eecc)
Closesscylladb/scylladb#23601
The member in question is unconditionally .stop()-ed in task's
release_resources() method, however, it may happen that the thing wasn't
.start()-ed in the first place. Start happens in the middle of the
task's .run() method and there can be several reasons why it can be
skipped -- e.g. the task is aborted early, or collecting sstables from
S3 throws.
fixes: #23231
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23483
(cherry picked from commit 832d83ae4b)
Closesscylladb/scylladb#23557
Function modification_statement::add_raw() is never called, which
makes query string in audit_info of batch queries empty. In enterprise
branch, add_raw is called in Cql.g and those changes were never merged
to master.
This changes:
- Add missing call of add_raw() to Cql.g
- Include other related changes (from PR#3228 in scylla-enterprise)
Fixes scylladb#23311
Closesscylladb/scylladb#23315
(cherry picked from commit b8adbcbc84)
Closesscylladb/scylladb#23495
Currently, repair_writer_impl::create_writer keeps erm to ensure
that a sharder is valid. If we repair a tablet, erm blocks the state
machine and no operation on any tablet of this table might be performed.
Use auto_refreshing_sharder and topology_guard to ensure that the
operation is safe and that tablet operations on the whole table
aren't blocked.
Fixes: #23453.
(cherry picked from commit 1dc29ddc86)
Draining hints may occur in one of the two scenarios:
* a node leaves the cluster and the local node drains all of the hints
saved for that node,
* the local node is being decommissioned.
Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.
There are two reasons for that:
* Generally, in situations like that, we'd like to be able to shut down
nodes as fast as possible. The data stored in the hints won't
disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
have the highest priority and it's reflected in the scheduling groups we
use as well as the explicitly enforced throughput. If there are a large
number of hints to be replayed, it might affect our tests.
It's already happened, see: scylladb/scylladb#21949.
To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.
That should ensure all data is retained, while possibly speeding up
the shutdown process.
There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.
We also provide tests verifying that the implementation works as intended.
Fixesscylladb/scylladb#21949Closesscylladb/scylladb#22811
(cherry picked from commit 0a6137218a)
Closesscylladb/scylladb#23370
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
1) tablets have the same size
2) shards have the same capacity
That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.
After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.
One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.
Fixes#23042
* github.com:scylladb/scylladb:
tablets: Make load balancing capacity-aware
topology_coordinator: Fix confusing log message
topology_coordinator: Refresh load stats after adding a new node
topology_coordinator: Allow capacity stats to be refreshed with some nodes down
topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
test: boost: tablets_test: Always provide capacity in load_stats
test: perf_load_balancing: Set node capacity
test: perf_load_balancing: Convert to topology_builder
config, disk_space_monitor: Allow overriding capacity via config
storage_service, tablets: Collect per-node capacity in load_stats
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
Closesscylladb/scylladb#23443
* github.com:scylladb/scylladb:
Merge 'tablets: Make load balancing capacity-aware' from Tomasz Grabiec
test: tablets_test: Add support for auto-split mode
test: cql_test_env: Expose db config
Refs scylla-enterprise#5185
Fixes#22901
If a tls socket gets EPIPE the error is not translated to a specific
gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat
EPIPE as ignorable for plain sockets, we need to unwind nested exception
here to detect that the error was in fact due to this, so we can suppress
log output for this.
Closesscylladb/scylladb#22888
(cherry picked from commit e49f2046e5)
Closesscylladb/scylladb#23045
The series fixes a regression and demotes a barrier_and_drain logging error to a warning since this particular condition may happen during normal operation.
We want to backport both since one is a bug fix and another is trivial and reduces CI flakiness.
- (cherry picked from commit 1da7d6bf02)
- (cherry picked from commit fe45ea505b)
Parent PR: #22650Closesscylladb/scylladb#22923
* https://github.com/scylladb/scylladb:
topology_coordinator: demote barrier_and_drain rpc failure to warning
topology_coordinator: read peers table only once during topology state application
In the current scenario, the problem discovered is that there is a time
gap between group0 creation and raft_initialize_discovery_leader call.
Because of that, the group0 snapshot/apply entry enters wrong values
from the disk(null) and updates the in-memory variables to wrong values.
During the above time gap, the in-memory variables have wrong values and
perform absurd actions.
This PR removes the variable `_manage_topology_change_kind_from_group0`
which was used earlier as a work around for correctly handling
`topology_change_kind` variable, it was brittle and had some bugs
(causing issues like scylladb/scylladb#21114). The reason for this bug
that _manage_topology_change_kind used to block reading from disk and
was enabled after group0 initialization and starting raft server for the
restart case. Similarly, it was hard to manage `topology_change_kind`
using `_manage_topology_change_kind_from_group0` correctly in bug free
anner.
Post `_manage_topology_change_kind_from_group0` removal, careful
management of `topology_change_kind` variable was needed for maintaining
correct `topology_change_kind` in all scenarios. So this PR also performs
a refactoring to populate all init data to system tables even before
group0 creation(via `raft_initialize_discovery_leader` function). Now
because `raft_initialize_discovery_leader` happens before the group 0
creation, we write mutations directly to system tables instead of a
group 0 command. Hence, post group0 creation, the node can read the
correct values from system tables and correct values are maintained
throughout.
Added a new function `initialize_done_topology_upgrade_state` which
takes care of updating the correct upgrade state to system tables before
starting group0 server. This ensures that the node can read the correct
values from system tables and correct values are maintained throughout.
By moving `raft_initialize_discovery_leader` logic to happen before
starting group0 server, and not as group0 command post server start, we
also get rid of the potential problem of init group0 command not being
the 1st command on the server. Hence ensuring full integrity as expected
by programmer.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#21114
- (cherry picked from commit 4748125a48)
- (cherry picked from commit e491950c47)
- (cherry picked from commit 623e01344b)
- (cherry picked from commit d7884cf651)
Parent PR: #22484Closesscylladb/scylladb#22966
* https://github.com/scylladb/scylladb:
storage_service: Remove the variable _manage_topology_change_kind_from_group0
storage_service: fix indentation after the previous commit
raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
service/raft: Refactor mutation writing helper functions.
This commit removes the outdated information about seed nodes.
We no longer need it in the docs, as a) the documentation is versioned,
and b) the ScyllaDB Open Source 4.3 and ScyllaDB Enterprise 2021.1 versions
mentioned in the docs are no longer supported.
In addition, some clarification has been added to the existing sections.
Fixes https://github.com/scylladb/scylladb/issues/22400Closesscylladb/scylladb#23282
(cherry picked from commit dbbf9e19e4)
Closesscylladb/scylladb#23327
Moving a PR out of draft is only allowed to users with write access,
adding a github action to switch PR to `ready for review` once the
`conflicts` label was removed
Closesscylladb/scylladb#22446
(cherry picked from commit ed4bfad5c3)
Closesscylladb/scylladb#23023
The test fails sporadically with:
cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test3.test2 - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
That's becase a server is stopped in the middle of the workload.
The server is stopped ungracefully which will cause some requests to
time out. We should stop it gracefully to allow in-flight requests to
finish.
Fixes#20492Closesscylladb/scylladb#23451
(cherry picked from commit 8e506c5a8f)
Closesscylladb/scylladb#23469
After recent changes #18640 and #19151 started to reproduce for
stop_after_sending_join_node_request and
stop_after_bootstrapping_initial_raft_configuration error injections too.
The solution is the same: deselect the tests.
Fixes#23302Closesscylladb/scylladb#23405
(cherry picked from commit 574c81eac6)
Closesscylladb/scylladb#23460
Before this patch, the load balancer was equalizing tablet count per
shard, so it achieved balance assuming that:
1) tablets have the same size
2) shards have the same capacity
That can cause imbalance of utilization if shards have different
capacity, which can happen in heterogeneous clusters with different
instance types. One of the causes for capacity difference is that
larger instances run with fewer shards due to vCPUs being dedicated to
IRQ handling. This makes those shards have more disk capacity, and
more CPU power.
After this patch, the load balancer equalizes shard's storage
utilization, so it no longer assumes that shards have the same
capacity. It still assumes that each tablet has equal size. So it's a
middle step towards full size-aware balancing.
One consequence is that to be able to balance, the load balancer need
to know about every node's capacity, which is collected with the same
RPC which collects load_stats for average tablet size. This is not a
significant set back because migrations cannot proceed anyway if nodes
are down due to barriers. We could make intra-node migration
scheduling work without capacity information, but it's pointless due
to above, so not implemented.
Also, per-shard goal for tablet count is still the same for all nodes in the cluster,
so nodes with less capacity will be below limit and nodes with more capacity will
be slightly above limit. This shouldn't be a significant problem in practice, we could
compensate for this by increasing the limit.
Refs #23042Closesscylladb/scylladb#23079
* github.com:scylladb/scylladb:
tablets: Make load balancing capacity-aware
topology_coordinator: Fix confusing log message
topology_coordinator: Refresh load stats after adding a new node
topology_coordinator: Allow capacity stats to be refreshed with some nodes down
topology_coordinator: Refactor load status refreshing so that it can be triggered from multiple places
test: boost: tablets_test: Always provide capacity in load_stats
test: perf_load_balancing: Set node capacity
test: perf_load_balancing: Convert to topology_builder
config, disk_space_monitor: Allow overriding capacity via config
storage_service, tablets: Collect per-node capacity in load_stats
(cherry picked from commit b1d9f80d85)
rebalance_tablets() was performing migrations and merges automatically
but not splits, because splits need to be acked by replicas via
load_stats. It's inconvenient in tests which want to rebalance to the
equilibrium point. This patch changes rebalance_tablets() to split
automatically by default, can be disabled for tests which expect
differently.
shared_load_stats was introduced to provide a stable holder of
load_stats which can be reused across rebalance_tablets() calls.
(cherry picked from commit 5e471c6f1b)
Do not hold erm during repair of a tablet that is started with tablet
repair scheduler. This way two different tablets can be repaired
and migrated concurrently. The same tablet won't be migrated while
being repaired as it is provided by topology coordinator.
Use topology_guard to maintain safety.
Fixes: https://github.com/scylladb/scylladb/issues/22408.
Needs backport to 2025.1 that introduces the tablet repair scheduler.
Closesscylladb/scylladb#23362
* github.com:scylladb/scylladb:
test: add test to check concurrent tablets migration and repair
repair: do not hold erm for repair scheduled by scheduler
repair: get total rf based on current erm
repair: make shard_repair_task_impl::erm private
repair: do not pass erm to put_row_diff_with_rpc_stream when unnecessary
repair: do not pass erm to flush_rows_in_working_row_buf when unnecessary
repair: pass session_id to repair_writer_impl::create_writer
repair: keep materialized topology guard in shard_repair_task_impl
repair: pass session_id to repair_meta
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.
The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
keyspaces are RF-rack-valid. If not, the initialization fails.
We provide tests verifying that the changes behave as intended.
---
Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.
---
Implementation strategy (going beyond the scope of this PR):
1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.
---
Fixes scylladb/scylladb#20356
Fixes scylladb/scylladb#23276
Fixes scylladb/scylladb#23300
---
Backport: this is part of the requirements for releasing 2025.1.
- (cherry picked from commit 32879ec0d5)
- (cherry picked from commit 41f862d7ba)
- (cherry picked from commit 0e04a6f3eb)
Parent PR: #23138Closesscylladb/scylladb#23398
* github.com:scylladb/scylladb:
main: Refuse to start node when RF-rack-invalid keyspace exists
cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
db/config: Introduce RF-rack-valid keyspaces
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.
Fixesscylladb/scylladb#23300
(cherry picked from commit 0e04a6f3eb)
In this commit, we refuse to create or alter a keyspace when that operation
would make it RF-rack-invalid if the option `rf_rack_valid_keyspaces` is
enabled.
We provide two tests verifying that the changes work as intended.
Fixesscylladb/scylladb#23276
(cherry picked from commit 41f862d7ba)
We introduce a new term in the glossary: RF-rack-valid keyspace.
We also highlight in our user documentation that all keyspaces
must remain RF-rack-valid throughout their lifetime, and failing
to guarantee that may result in data inconsistencies or other
issues. We base that information on our experience with materialized
views in keyspaces using tablets, even though they remain
an experimental feature.
Along with the new term, we introduce a new configuration option
called `rf_rack_valid_keyspaces`, which, when enabled, will enforce
preserving all keyspaces RF-rack-valid. That functionality will be
implemented in upcoming commits. For now, we materialize the
restriction in form of a named requirement: a function verifying
that the passed keyspace is RF-rack-valid.
The option is disabled by default. That will change once we adjust
the existing tests to the new semantics. Once that is done, the option
will first be enabled by default, and then it will be removed.
Fixesscylladb/scylladb#20356
(cherry picked from commit 32879ec0d5)
Do not hold erm for tablet repair scheduled by scheduler. Thanks to
that one tablet repair won't exclude migration of other tablets.
Concurrent repair and migration of the same tablet isn't possible,
since a tablet can be in one type of transition only at the time.
Hence the change is safe.
Refs: https://github.com/scylladb/scylladb/issues/22408.
(cherry picked from commit 5b792bdc98)
Get total rf based on erm. Currently, it does not change anything
because erm stays the same during the whole repair.
(cherry picked from commit a1375896df)
When small_table_optimization isn't enabled, put_row_diff_with_rpc_stream
does not access erm. Pass small_table_optimization_params containing erm
only when small_table_optimization is enabled.
This is safe as erm is kept by shard_repair_task_impl.
(cherry picked from commit 444c7eab90)
When small_table_optimization isn't enabled, flush_rows_in_working_row_buf
does not access erm. Add small_table_optimization_params containing erm and
pass it only when small_table_optimization is enabled.
This is safe as erm is kept by shard_repair_task_impl.
(cherry picked from commit e56bb5b6e2)
Keep materialized topology guard in shard_repair_task_impl and check
it in check_in_abort_or_shutdown and before each range repair.
(cherry picked from commit 47bb9dcf78)
Pass session_id of tablet repair down the stack from the repair request
to repair_meta.
The session_id will be utiziled in the following patches.
(cherry picked from commit 928f92c780)
service: Introduce rack-aware co-location migrations for tablet merge
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.
Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]
Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.
This is the main problem fixed by this patch.
It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.
Fixes#22994.
Refs #17265.
This patch is based on Raphael's work done in PR #23081. The main differences are:
1) Instead of sorting replicas by rack, we try to find
replicas in sibling tablets which belong to the same rack.
This is similar to how we match replicas within the same host.
It reduces number of across-rack migrations even if RF!=#racks,
which the original patch didn't handle.
Unlike the original patch, it also avoids rack-overloaded in case
RF!=#racks
2) We emit across-rack co-locating migrations if we have no other choice
in order to finalize the merge
This is ok, since views are not supported with tablets yet. Later,
we will disallow this for tables which have views, and we will
allow creating views in the first place only when no such migrations
can happen (RF=#racks).
3) Added boost unit test which checks that rack overload is avoided during merge
in case RF<#racks
4) Moved logging of across-rack migration to debug level
5) Exposed metric for across-rack co-locating migrations
(cherry picked from commit af949f3b6a)
Also backports dependent patches:
- locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
- locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
- Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
Closesscylladb/scylladb#22657Closesscylladb/scylladb#22652Closesscylladb/scylladb#23297
* github.com:scylladb/scylladb:
service: Introduce rack-aware co-location migrations for tablet merge
Merge 'test: tablets_test: Create proper schema in load balancer tests' from Tomasz Grabiec
locator: network_topology_startegy: Ignore leaving nodes when computing capacity for new tables
locator: network_topology_strategy: Fix SIGSEGV when creating a table when there is a rack with no normal nodes
In commit c24bc3b we decided that creating a new table in Alternator
will by default use vnodes - not tablets - because of all the missing
features in our tablets implementation that are important for
Alternator, namely - LWT, CDC and Alternator TTL.
We never documented this, or the fact that we support a tag
`experimental:initial_tablets` which allows to override this decision
and create an Alternator table using tablets. We also never documented
what exactly doesn't work when Alternator uses tablet.
This patch adds the missing documentation in docs/alternator/new-apis.md
(which is a good place for describing the `experimental:initial_tablets`
tag). The patch also adds a new test file, test_tablets.py, which
includes tests for all the statements made in the document regarding
how `experimental:initial_tablets` works and what works or doesn't
work when tablets are enabled.
Two existing tests - for TTL and Streams non-support with tablets -
are moved to the new test file.
When the tablets feature will finally be completed, both the document
and the tests will need to be modified (some of the tests should be
outright deleted). But it seems this will not happen for at least
several months, and that is too long to wait without accurate
documentation.
Fixes#21629
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#22462
(cherry picked from commit c0821842de)
Closesscylladb/scylladb#23298
Merge co-location can emit migrations across racks even when RF=#racks,
reducing availability and affecting consistency of base-view pairing.
Given replica set of sibling tablets T0 and T1 below:
[T0: (rack1,rack3,rack2)]
[T1: (rack2,rack1,rack3)]
Merge will co-locate T1:rack2 into T0:rack1, T1 will be temporarily only at
only a subset of racks, reducing availability.
This is the main problem fixed by this patch.
It also lays the ground for consistent base-view replica pairing,
which is rack-based. For tables on which views can be created we plan
to enforce the constraint that replicas don't move across racks and
that all tablets use the same set of racks (RF=#racks). This patch
avoids moving replicas across racks unless it's necessary, so if the
constraint is satisfied before merge, there will be no co-locating
migrations across racks. This constraint of RF=#racks is not enforced
yet, it requires more extensive changes.
Fixes#22994.
Refs #17265.
This patch is based on Raphael's work done in PR #23081. The main differences are:
1) Instead of sorting replicas by rack, we try to find
replicas in sibling tablets which belong to the same rack.
This is similar to how we match replicas within the same host.
It reduces number of across-rack migrations even if RF!=#racks,
which the original patch didn't handle.
Unlike the original patch, it also avoids rack-overloaded in case
RF!=#racks
2) We emit across-rack co-locating migrations if we have no other choice
in order to finalize the merge
This is ok, since views are not supported with tablets yet. Later,
we will disallow this for tables which have views, and we will
allow creating views in the first place only when no such migrations
can happen (RF=#racks).
3) Added boost unit test which checks that rack overload is avoided during merge
in case RF<#racks
4) Moved logging of across-rack migration to debug level
5) Exposed metric for across-rack co-locating migrations
(cherry picked from commit af949f3b6a)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>