A write to a base table can generate one or more writes to a materialized
view. The write to RF base replicas need to cause writes to RF view
replicas. Our MV implementation, based on Cassandra's implementation,
does this via "pairing": Each one of the base replicas involved in this
write sends each view update to exactly one view replica. The function
get_view_natural_endpoint() tells a base replica which of the view
replicas it should send the update to.
The standard pairing is based on the ring order: The first owner of the
base token sends to the first owner of the view token, the second to the
second, and so on. However, the existing code also uses an optimization
we call self-pairing: If a single node is both a base replica and a base
replica, the pairing is modified so this node sends the update to itself.
This patch *disables* the self-pairing optimization in keyspaces that
use tablets:
The self-pairing optimization can cause the pairing to change after
token ranges are moved between nodes, so it can break base-view consistency
in some edge cases, leading to "ghost rows". With tablets, these range
movements become even more frequent - they can happen even if the
cluster doesn't grow. This is why we want to solve this problem for tablets.
For backward compatibility and to avoid sudden inconsistencies emerging
during upgrades, we decided to continue using the self-pairing optimization
for keyspaces that are *not* using tablets (i.e., using vnoodes).
Currently, we don't introduce a "CREATE MATERIALIZED VIEW" option to
override these defaults - i.e., we don't provide a way to disable
self-pairing with vnodes or to enable them with tablets. We could introduce
such a schema flag later, if we ever want to (and I'm not sure we want to).
It's important to note, that in some cases, this change has implications
on when view updates become synchronous, in the tablets case.
For example:
* If we have 3 nodes and RF=3, with the self-pairing optimization each
node is paired with itself, the view update is local, and is
implicitly synchronous (without requiring a "synchronous_updates"
flag).
* In the same setup with tablets, without the self-pairing optimization
(due to this patch), this is not guaranteed. Some view updates may not
be synchronous, i.e., the base write will not wait for the view
write. If the user really wants synchronous updates, they should
be requested explicitly, with the "synchronous_updates" view option.
Fixes#16260.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#16272