Docs: Describe driver issue with tablet RF increase

Current protocol extension that sends tablet info to drivers only does
that if the driver selects a non-replica coordinator for a routable
request. It works well if some node on the replica list is replaced by
other node, or if some replicas are removed from the list. Driver will
at some point send a request to stale replica, and receive new list in
response.

The issue is with extending the list with new replicas. In that case old
replicas are all still correct, so driver will not select any wrong
replica, and will not receive the new list. As far as I know that only
scenario where this could happen is RF increase.

It could be to some degree worked around in the drivers, but it would
add significant complexity (definitely more than any other invalidations
we introduced) while still not being ideal solution. This scenario
should be rare enough, and the consequences of not handling it minor
enough (new replicas not being used as coordinators) that it does not
warrant driver-side solution. Instead this commit adds info about this
to documentation, advising users to restart applications after replica
lists are extended.

It is worth noting that if new tablet feedback protocol extension is
implemented then this problem goes away. See issue #21664.

Closes scylladb/scylladb#23447
This commit is contained in:
Karol Baryła
2025-04-01 09:14:33 +02:00
committed by Tomasz Grabiec
parent cf11d5eb69
commit df64985a4e
2 changed files with 8 additions and 2 deletions

View File

@@ -302,6 +302,7 @@ Modifying a keyspace with tablets enabled is possible and doesn't require any sp
- The ``ALTER`` statement may take longer than the regular query timeout, and even if it times out, it will continue to execute in the background.
- The replication strategy cannot be modified, as keyspaces with tablets only support ``NetworkTopologyStrategy``.
- The ``ALTER`` statement will fail if it would make the keyspace :term:`RF-rack-invalid <RF-rack-valid keyspace>`.
- After the ``ALTER`` statement that increases the RF finishes, client applications should be restarted. Without a restart, drivers will not know about new replicas, which may cause request imbalance.
.. _drop-keyspace-statement:

View File

@@ -9,13 +9,16 @@ How to Safely Increase the Replication Factor
**Audience: ScyllaDB administrators**
Issue
-----
Issues
------
When a Replication Factor (RF) is increased, using the :ref:`ALTER KEYSPACE <alter-keyspace-statement>` command, the data consistency is effectively dropped
by the difference of the RF_new value and the RF_old value for all pre-existing data.
Consistency will only be restored after running a repair.
Another issue occurs in keyspaces with tablets enabled and is driver-related. Due to limitations in the current protocol used to pass tablet data to drivers, drivers will not pick
up new replicas after replication factor is increased. This will cause them to avoid routing requests to those replicas, causing imbalance.
Resolution
----------
@@ -27,6 +30,8 @@ As a result, in order to make sure that you can keep on reading the old data wit
After you run a repair, you can decrease the CL. If RF has only been changed in a particular Data Center (DC) only the nodes in that DC have to be repaired.
To resolve the driver-related issue, restart the client applications after the ALTER statement that changes the RF completes successfully.
Example
=======