From df64985a4eff504d3d99741ace2b599e40d16f2c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Karol=20Bary=C5=82a?= Date: Tue, 1 Apr 2025 09:14:33 +0200 Subject: [PATCH] Docs: Describe driver issue with tablet RF increase Current protocol extension that sends tablet info to drivers only does that if the driver selects a non-replica coordinator for a routable request. It works well if some node on the replica list is replaced by other node, or if some replicas are removed from the list. Driver will at some point send a request to stale replica, and receive new list in response. The issue is with extending the list with new replicas. In that case old replicas are all still correct, so driver will not select any wrong replica, and will not receive the new list. As far as I know that only scenario where this could happen is RF increase. It could be to some degree worked around in the drivers, but it would add significant complexity (definitely more than any other invalidations we introduced) while still not being ideal solution. This scenario should be rare enough, and the consequences of not handling it minor enough (new replicas not being used as coordinators) that it does not warrant driver-side solution. Instead this commit adds info about this to documentation, advising users to restart applications after replica lists are extended. It is worth noting that if new tablet feedback protocol extension is implemented then this problem goes away. See issue #21664. Closes scylladb/scylladb#23447 --- docs/cql/ddl.rst | 1 + docs/kb/rf-increase.rst | 9 +++++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/cql/ddl.rst b/docs/cql/ddl.rst index b93f3de00a..abfc60462d 100644 --- a/docs/cql/ddl.rst +++ b/docs/cql/ddl.rst @@ -302,6 +302,7 @@ Modifying a keyspace with tablets enabled is possible and doesn't require any sp - The ``ALTER`` statement may take longer than the regular query timeout, and even if it times out, it will continue to execute in the background. - The replication strategy cannot be modified, as keyspaces with tablets only support ``NetworkTopologyStrategy``. - The ``ALTER`` statement will fail if it would make the keyspace :term:`RF-rack-invalid `. +- After the ``ALTER`` statement that increases the RF finishes, client applications should be restarted. Without a restart, drivers will not know about new replicas, which may cause request imbalance. .. _drop-keyspace-statement: diff --git a/docs/kb/rf-increase.rst b/docs/kb/rf-increase.rst index 8b0f4b5396..26f3a8481b 100644 --- a/docs/kb/rf-increase.rst +++ b/docs/kb/rf-increase.rst @@ -9,13 +9,16 @@ How to Safely Increase the Replication Factor **Audience: ScyllaDB administrators** -Issue ------ +Issues +------ When a Replication Factor (RF) is increased, using the :ref:`ALTER KEYSPACE ` command, the data consistency is effectively dropped by the difference of the RF_new value and the RF_old value for all pre-existing data. Consistency will only be restored after running a repair. +Another issue occurs in keyspaces with tablets enabled and is driver-related. Due to limitations in the current protocol used to pass tablet data to drivers, drivers will not pick +up new replicas after replication factor is increased. This will cause them to avoid routing requests to those replicas, causing imbalance. + Resolution ---------- @@ -27,6 +30,8 @@ As a result, in order to make sure that you can keep on reading the old data wit After you run a repair, you can decrease the CL. If RF has only been changed in a particular Data Center (DC) only the nodes in that DC have to be repaired. +To resolve the driver-related issue, restart the client applications after the ALTER statement that changes the RF completes successfully. + Example =======