Commit Graph

515 Commits

Author SHA1 Message Date
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Kamil Braun
9c2a836607 Revert "Merge 'Drain view_builder in generic drain' from ScyllaDB"
This reverts commit 298a7fcbf2, reversing
changes made to 5cf53e670d.

The change made CI flaky.

Fixes: scylladb/scylladb#18278
2024-04-18 11:50:41 +02:00
Pavel Emelyanov
90593f4e82 view_builder: Generalize mark_as_built(view_ptr) method
Marking is performed in two places and they can be generalized

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-05 19:56:12 +03:00
Pavel Emelyanov
3c3f2cd337 view_builder: Move mark_existing_views_as_built from storage service
Now it's in the correct component

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-05 19:56:11 +03:00
Wojciech Mitros
9789a3dc7c mv: keep semaphore units alive until the end of a remote view update
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.

Fixes #17890

Closes scylladb/scylladb#17891
2024-03-25 19:43:58 +02:00
Avi Kivity
72bbe75d5b Merge 'Fix node replace with tablets for RF=N' from Tomasz Grabiec
This PR fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because tablet replica allocation for
rebuild will not be able to find a viable destination, as the replacing node
is not considered to be a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.

The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is in left state and replacing node is in normal state.

The replacing node waits for this draining to be complete on boot
before the node is considered booted.

Fixes https://github.com/scylladb/scylladb/issues/17025

Nodes in the left state will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:

 1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first.

 2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement.

It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.

Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).

In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet replica sets.

Currently left nodes are never removed from topology, so will
accumulate in memory. We could garbage-collect them from topology
coordinator if a left node is absent in any replica set. That means we
need a new state - left_for_real.

Closes scylladb/scylladb#17388

* github.com:scylladb/scylladb:
  test: py: Add test for view replica pairing after replace
  raft, api: Add RESTful API to query current leader of a raft group
  test: test_tablets_removenode: Verify replacing when there is no spare node
  doc: topology-on-raft: Document replace behavior with tablets
  tablets, raft topology: Rebuild tablets after replacing node is normal
  tablets: load_balancer: Access node attributes via node struct
  tablets: load_balancer: Extract ensure_node()
  mv: Switch to using host_id-based replica set
  effective_replication_map: Introduce host_id-based get_replicas()
  raft topology: Keep nodes in the left state to topology
  tablets: Introduce read_required_hosts()
2024-03-18 16:16:08 +02:00
Wojciech Mitros
efcb718e0a mv: adjust memory tracking of single view updates within a batch
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.

Fixes #17854

Closes scylladb/scylladb#17855
2024-03-18 14:31:54 +02:00
Tomasz Grabiec
9b656ec2aa mv: Switch to using host_id-based replica set
This is necessary to not break replica pairing between base and
view. After replacing a node, tablet replica set contains for a while
the replaced node which is in the left state. This node is not
returned by the IP-based get_natural_endpoints() so the replica
indexes would shift, changing the pairing with the view.

The host_id-based replica set always has stable indexes for replicas.
2024-03-15 11:05:29 +01:00
Pavel Emelyanov
3a734facc7 view_builder: Complete build step early if reader produces nothing
Builder works in "steps". Each step runs for a given base table, when a
new view is created it either initiates a step or appends to currently
running step.

Running a step means reading mutations from local sstables reader and
applying them to all views that has jumped into this step so far. When a
view is added to the step it remembers the current token value the step
is on. When step receives end-of-stream it rewinds to minimal-token.
Rewinding is done by closing current reader and creating a new one. Each
time token is advanced, all the views that meet the new token value for
the second time (i.e. -- scan full round) are marked as built and are
removed from step. When no views are left on step, it finishes.

The above machinery can break when rewinding the end-of-stream reader.
The trick is that a running step silently assumes that if the reader
once produced some token (and there can be a view that remembered this
token as its starting one), then after rewinding the reader would
generate the same token or greater. With tablets, however, that's not
the case. When a node is decommissioned tablets are cleaned and all
sstables are removed. Rewinding a reader after it makes empty reader
that produces no tokens from now on. Respectively, any build steps that
had captured tokens prior to cleanup would get stuck forever.

The fix is to check if the mutation consumer stepped at least one step
forward after rewind, and if no -- complete all the attached views.

fixes: #17293

Similar thing should happen if the base table is truncated with views
being built from it. Testing it steps on compaction assertion elsewhere
and needs more research.

refs: #17543

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17548
2024-03-12 14:58:47 +02:00
Nadav Har'El
19bcea6216 materialized views: fix rare failure caused by empty update
This one-line patch fixes a failure in the dtest

        lwt_schema_modification_test.py::TestLWTSchemaModification
        ::test_table_alter_delete

Where an update sometimes failed due to an internal server error, and the
log had the mysterious warning message:

        "std::logic_error (Empty materialized view updated)"

We've also seen this log-message in the past in another user's log, and
never understood what it meant.

It turns out that the error message was generated (and warning printed)
while building view updates for a base-table mutation, and noticing that
the base mutation contains an *empty* row - a row with no cells or
tombstone or anything whatsoever. This case was deemed (8 years ago,
in d5a61a8c48) unexpected and nonsensical,
and we threw an exception. But this case actually *can* happen - here is
how it happened in test_table_alter_delete - which is a test involving
a strange combination of materialized views, LWT and schema changes:

 1. A table has a materialized view, and also a regular column "int_col".
 2. A background thread repeatedly drops and re-creates this column
    int_col.
 3. Another thread deletes rows with LWT ("IF EXISTS").
 4. These LWT operations each reads the existing row, and because of
    repeated drop-and-recreate of the "int_col" column, sometimes this
    read notices that one node has a value for int_col and the other
    doesn't, and creates a read-repair mutation setting int_col (the
    difference between the two reads includes just in this column).
 5. The node missing "int_col" receives this mutation which sets only
    int_col. It upgrade()s this mutation to its most recent schema,
    which doesn't have int_col, so it removes this column from the
    mutation row - and is left with a completely empty mutation row.
    This completely empty row is not useful, but upgrade() doesn't
    remove it.
 6. The view-update generation code sees this empty base-mutation row
    and fails it with this std::logic_error.
 7. The node which sent the read-repair mutation sees that the read
    repair failed, so it fails the read and therefore fails the LWT
    delete operation.
    It is this LWT operation which failed in the test, and caused
    the whole test to fail.

The fix is trivial: an empty base-table row mutation should simply be
*ignored* when generating view updates - it shouldn't cause any error.

Before this patch, test_table_alter_delete used to fail in roughly
20% of the runs on my laptop. After this patch, I ran it 100 times
without a single failure.

Fixes #15228
Fixes #17549

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17607
2024-03-07 12:00:43 +02:00
Avi Kivity
51df8b9173 interval: rename nonwrapping_interval to interval
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.

We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.

Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.

We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
2024-02-21 19:43:17 +02:00
Avi Kivity
605bf6e221 range.hh: retire
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.

Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.

The unit tests are renamed and range.hh is deleted.

Closes scylladb/scylladb#17428
2024-02-21 00:24:25 +02:00
Botond Dénes
7f17d3bb0e replica/database: keyspace: add uses_tablets()
Mirroring table::uses_tablets(), provides a convenient and -- more
importabtly -- easily discoverable way to determine whether the keyspace
uses tablets or not.
This information is of course already available via the abstract
replication strategy, but as seen in a few examples, this is not easily
discoverable and sometimes people resorted to enumerating the keyspace's
tables to be able to invoke table::uses_tablets().
2024-02-15 01:51:26 -05:00
Nadav Har'El
21e7deafeb alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172
2024-02-12 13:17:29 +02:00
Nadav Har'El
14315fcbc3 mv: fix missing view deletions in some cases of range tombstones
For efficiency, if a base-table update generates many view updates that
go the same partition, they are collected as one mutation. If this
mutation grows too big it can lead to memory exhaustion, so since
commit 7d214800d0 we split the output
mutation to mutations no longer than 100 rows (max_rows_for_view_updates)
each.

This patch fixes a bug where this split was done incorrectly when
the update involved range tombstones, a bug which was discovered by
a user in a real use case (#17117).

Range tombstones are read in two parts, a beginning and an end, and the
code could split the processing between these two parts and the result
that some of the range tombstones in update could be missed - and the
view could miss some deletions that happened in the base table.

This patch fixes the code in two places to avoid breaking up the
processing between range tombstones:

1. The counter "_op_count" that decides where to break the output mutation
   should only be incremented when adding rows to this output mutation.
   The existing code strangely incrmented it on every read (!?) which
   resulted in the counter being incremented on every *input* fragment,
   and in particular could reach the limit 100 between two range
   tombstone pieces.

2. Moreover, the length of output was checked in the wrong place...
   The existing code could get to 100 rows, not check at that point,
   read the next input - half a range tombstone - and only *then*
   check that we reached 100 rows and stop. The fix is to calculate
   the number of rows in the right place - exactly when it's needed,
   not before the step.

The first change needs more justification: The old code, that incremented
_op_count on every input fragment and not just output fragments did not
fit the stated goal of its introduction - to avoid large allocations.
In one test it resulted in breaking up the output mutation to chunks of
25 rows instead of the intended 100 rows. But, maybe there was another
goal, to stop the iteration after 100 *input* rows and avoid the possibility
of stalls if there are no output rows? It turns out the answer is no -
we don't need this _op_count increment to avoid stalls: The function
build_some() uses `co_await on_results()` to run one step of processing
one input fragment - and `co_await` always checks for preemption.
I verfied that indeed no stalls happen by using the existing test
test_long_skipped_view_update_delete_with_timestamp. It generates a
very long base update where all the view updates go to the same partition,
but all but the last few updates don't generate any view updates.
I confirmed that the fixed code loops over all these input rows without
increasing _op_count and without generating any view update yet, but it
does NOT stall.

This patch also includes two tests reproducing this bug and confirming
its fixed, and also two additional tests for breaking up long deletions
that I wanted to make sure doesn't fail after this patch (it doesn't).

By the way, this fix would have also fixed issue #12297 - which we
fixed a year ago in a different way. That issue happend when the code
went through 100 input rows without generating *any* output rows,
and incorrectly concluding that there's no view update to send.
With this fix, the code no longer stops generating the view
update just because it saw 100 input rows - it would have waited
until it generated 100 output rows in the view update (or the
input is really done).

Fixes #17117

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17164
2024-02-06 14:57:33 +02:00
Avi Kivity
7cb1c10fed treewide: replace seastar::future::get0() with seastar::future::get()
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.
2024-02-02 22:12:57 +08:00
Nadav Har'El
1bcaeb89c7 view: revert cleanup filter that doesn't work with tablets
This patch reverts commit 10f8f13b90 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,

1. I believe this filter was never necessary, because the view update
   code already silently ignores base updates which do not belong to
   this replica (see get_view_natural_endpoint()). After all, the view
   update needs to know that this replica is the Nth owner of the base
   update to send its update to the Nth view replica, but if no such
   N exists, no view update is sent.

2. The code introduced for that filter used a per-keyspace replication
   map, which was ok for vnodes but no longer works for tablets, and
   causes the operation using it to fail.

3. The filter was used every time the "view update generator" was used,
   regardless of whether any cleanup is necessary or not, so every
   such operation would fail with tablets. So for example the dtest
   test_mvs_populating_from_existing_data fails with tablets:
     * This test has view building in parallel with automatic tablet
       movement.
     * Tablet movement is streaming.
     * When streaming happens before view building has finished, the
       streamed sstables get "view update generator" run on them.
       This causes the problematic code to be called.

Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.

Fixes #16598

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-01-14 13:24:44 +02:00
Nadav Har'El
0fe40f729e mv: sleep a bit before view-update-generator restart
The "view update generator" is responsible for generating view updates
for staging sstables (such as coming from repair). If the processing
fails, the code retries - immediately. If there is some persistent bug,
such as issue #16598, we will have a tight loop of error messages,
potentially a gigabyte of identical messages every second.

In this patch we simply add a sleep of one second after view update
generation fails before retrying. We can still get many identical
error messages if there is some bug, but not more than one per second.

Refs #16598.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-01-14 13:13:52 +02:00
Kefu Chai
be364d30fd db: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16664
2024-01-09 11:44:19 +02:00
Avi Kivity
6394854f04 Merge 'Some cleanups in tests for tablets + MV ' from Nadav Har'El
This small series improves two things in the multi-node tests for tablet supports in materialized views:

1. The test for Alternator LSI, which "sometimes" could reproduce the bug by creating 10-node cluster with a random tablet distribution, is replaced by a reliable 2-node cluster which controls the tablet distribution. The new test also confirms that tablets are actually enabled in Alternator (reviewers of the original test noted it would be easy to pass the test if tablets were accidentally not enabled... :-)).
2. Simplify the tablet lookup code in the test to not go through a "table id", and lookup the table's (or view's) name directly (requires a full-table of the tablets table, but that's entirely reasonable in a test).

The third patch in this series also fixes a comment typo discovered in a previous review.

Closes scylladb/scylladb#16440

* github.com:scylladb/scylladb:
  materialized views: fix typo in comment
  test_mv_tablets: simplify lookup of tablets
  alternator, tablets: improve Alternator LSI tablets test
2023-12-27 20:18:14 +02:00
Benny Halevy
060b16f987 view: apply_to_remote_endpoints: fix use-after-free
b815aa021c added a yield before
the trace point, causing the moved `frozen_mutation_and_schema`
(and `inet_address_vector_topology_change`) to drop out of scope
and be destroyed, as the rvalue-referenced objects aren't moved
onto the coroutine frame.

This change passes them by value rather than by rvalue-reference
so they will be stored in the coroutine frame.

Fixes #16540

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#16541
2023-12-24 21:43:48 +02:00
Nadav Har'El
6640278aa7 materialized views: fix typo in comment
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-12-24 10:12:44 +02:00
Nadav Har'El
b815aa021c mv, test: fix delay_before_remote_view_update injection point
The "delay_before_remote_view_update" is a recently-added injection
point which should add a delay before remove view updates, but NOT
force the writer to wait for it (whether the writer waits for it or
not depends on whether the view is configured as synchronous or not).

Unfortunately, the delay was added at the WRONG place, which caused
it to sometimes be done even on asynchronous views, breaking (with
false-negative) the tests that need this delay to reproduce bugs of
missing synchronous updates (Refs #16371).

The fix here is even simpler then the (wrong) old code - we just add
the sleep to the existing function apply_to_remote_endpoints() instead
of making the caller even more complex.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-12-21 11:44:50 +02:00
Nadav Har'El
37b5c03865 mv: coroutinize wait code for remote view updates
In the previous patch we added a delay injection point (for testing)
in the view update code. Because the code was using continuation style,
this resulted in increased indentation and ugly repetition of captures.

So in this patch we coroutinize the code that waits for remote view
updates, making it simpler, shorter, and less indented.

Note that this function still uses continuations in one place:
The remote view update is still composed of two steps that need
to happen one after another, but we don't necessarily need to wait
for them to happen. This is easiest to do with chaining continuations,
and then either waiting or not waiting for the resulting future.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-12-17 20:15:08 +02:00
Nadav Har'El
bf6848d277 mv, test: add injection point to delay remove view update
It's difficult to write a test (as we plan to do in to in the next patch)
that verifies that synchronous view updates are indeed synchronous, i.e.,
that write with CL=QUORUM on the base-table write returns only after
CL=QUORUM was also achieved in the view table. The difficulty is that in a
fast test machine, even if the synchronous-view-update is completely buggy,
it's likely that by the time the test reads from the view, all view updates
will have been completed anyway.

So in this patch we introduce an injection point, for testing, named
"delay_before_remote_view_update", which adds a delay before the base
replica sends its update to the remote view replica (in case the view
replica is indeed remote). As usual, this injection point isn't
configurable - when enabled it adds a fixed (0.5 second) delay, on all
view updates on all tables.

The existing code used continuation-style Seastar programming, and the
addition of the injection point in this patch made it even uglier, so
in the next patch we will coroutine-ize this code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2023-12-17 20:15:08 +02:00
Petr Gusev
7b55ccbd8e token_metadata: drop the template
Replace token_metadata2 ->token_metadata,
make token_metadata back non-template.

No behavior changes, just compilation fixes.
2023-12-12 23:19:54 +04:00
Petr Gusev
e50dbef3e2 database: get_token_metadata -> new token_metadata
database::get_token_metadata() is switched to token_metadata2.

get_all_ips method is added to the host_id-based token_metadata, since
its convenient and will be used in several places. It returns all current
nodes converted to inet_address by means of the topology
contained within token_metadata.

hint_sender::can_send: if the node has already left the
cluster we may not find its host_id. This case is handled
in the same way as if it's not a normal token owner - we
simply send a hint to all replicas.
2023-12-12 23:19:53 +04:00
Petr Gusev
5a1418fdba token_metadata: get_endpoint_for_host_id -> get_endpoint_for_host_id_if_known
This commit fixes an inconsistency in method names:
get_host_id and get_host_id_if_known are
(internal_error, returns null), but there was only
one method for the opposite conversion - get_endpoint_for_host_id,
and it returns null. In this commit we change it to on_internal_error
if it can't find the argument and add another method
get_endpoint_for_host_id_if_known which returns null in this case.

We can't use get_endpoint_for_host_id/get_host_id
in host_id_or_endpoint::resolve since it's called
from storage_service::parse_node_list
-> token_metadata::parse_host_id_and_endpoint,
and exceptions are caught and handled in
`storage_service::parse_node_list`.
2023-12-11 12:51:34 +04:00
Petr Gusev
63f64f3303 token_metadata: make it a template with NodeId=inet_address/host_id
NodeId is used in all internal token_metadata data structures, that
previously used inet_address. We choose topology::key_kind based
on the value of the template parameter.

generic_token_metadata::update_topology overload with host_id
parameter is added to make update_topology_change_info work,
it now uses NodeId as a parameter type.

topology::remove_endpoint(host_id) is added to make
generic_token_metadata::remove_endpoint(NodeId) work.

pending_endpoints_for and endpoints_for_reading are just removed - they
are not used and not implemented. The declarations were left by mistake
from a refactoring in which these methods were moved to erm.

generic_token_metadata_base is extracted to contain declarations, common
to both token_metadata versions.

Templates are explicitly instantiated inside token_metadata.cc, since
implementation part is also a template and it's not exposed to the header.

There are no other behavioral changes in this commit, just syntax
fixes to make token_metadata a template.
2023-12-11 12:51:34 +04:00
Nadav Har'El
300e549267 tablets, mv: disable self-pairing when tablets are used
A write to a base table can generate one or more writes to a materialized
view. The write to RF base replicas need to cause writes to RF view
replicas. Our MV implementation, based on Cassandra's implementation,
does this via "pairing": Each one of the base replicas involved in this
write sends each view update to exactly one view replica. The function
get_view_natural_endpoint() tells a base replica which of the view
replicas it should send the update to.

The standard pairing is based on the ring order: The first owner of the
base token sends to the first owner of the view token, the second to the
second, and so on. However, the existing code also uses an optimization
we call self-pairing: If a single node is both a base replica and a base
replica, the pairing is modified so this node sends the update to itself.

This patch *disables* the self-pairing optimization in keyspaces that
use tablets:

The self-pairing optimization can cause the pairing to change after
token ranges are moved between nodes, so it can break base-view consistency
in some edge cases, leading to "ghost rows". With tablets, these range
movements become even more frequent - they can happen even if the
cluster doesn't grow.  This is why we want to solve this problem for tablets.

For backward compatibility and to avoid sudden inconsistencies emerging
during upgrades, we decided to continue using the self-pairing optimization
for keyspaces that are *not* using tablets (i.e., using vnoodes).

Currently, we don't introduce a "CREATE MATERIALIZED VIEW" option to
override these defaults - i.e., we don't provide a way to disable
self-pairing with vnodes or to enable them with tablets. We could introduce
such a schema flag later, if we ever want to (and I'm not sure we want to).

It's important to note, that in some cases, this change has implications
on when view updates become synchronous, in the tablets case.
For example:

  * If we have 3 nodes and RF=3, with the self-pairing optimization each
    node is paired with itself, the view update is local, and is
    implicitly synchronous (without requiring a "synchronous_updates"
    flag).
  * In the same setup with tablets, without the self-pairing optimization
    (due to this patch), this is not guaranteed. Some view updates may not
    be synchronous, i.e., the base write will not wait for the view
    write. If the user really wants synchronous updates, they should
    be requested explicitly, with the "synchronous_updates" view option.

Fixes #16260.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16272
2023-12-06 17:11:17 +02:00
Benny Halevy
63b556123b db/view: use locator::topology rather than fb_utilities
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-12-05 08:55:46 +02:00
Nadav Har'El
4505a86f46 tablets, mv: fix base-view pairing to consider base replication map
In the view update code, the function get_view_natural_endpoint()
determines which view replica this base replica should send an update
to. It currently gets the *view* table's replication map (i.e., the map
from view tokens to lists of replicas holding the token), but assumes
that this is also the *base* table's replication map.

This assumption was true with vnodes, but is no longer true with
tablets - the base table's replication map can be completely different
from the view table's. By looking at the wrong mapping,
get_view_natural_endpoint() can believe that this node isn't really
a base-replica and drop the view update. Alternatively, it can think
it is a base replica - but use the wrong base-view pairing and create
base-view inconsistencies.

This patch solves this bug - get_view_natural_endpoint() now gets two
separate replication maps - the base's and the view's. The callers
need to remember what the base table was (in some cases they didn't
care at the point of the call), and pass it to the function call.

This patch also includes a simple test that reproduces the bug, and
confirms it is fixed: The test has a 6-node cluster using tablets
and a base table with RF=1, and writes one row to it. Before this
patch, the code usually gets confused, thinking the base replica
isn't a replica and loses the view update. With this patch, the
view update works.

Fixes #16227.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16228
2023-12-04 16:38:54 +02:00
Yaniv Kaul
2b73793a39 Update db/view/view.cc 2023-12-03 10:07:45 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Nadav Har'El
62f89d49e5 tablets, mv: fix on_internal_error on write to base table
This situation before this patch is that when tablets are enabled for
a keyspace, we can create a materialized view but later any write to
the base table fails with an on_internal_error(), saying that:

     "Tried to obtain per-keyspace effective replication map of test
      but it's per-table."

Indeed, with tablets, the replication is different for each table - it's
not the same for the entire keyspace.

So this patch changes the view update code to take the replication
map from the specific base table, not the keyspace.

This is good enough to get materialized-views reads and writes working
in a simple single-node case, as the included test demonstrates (the
test fails with on_internal_error() before this patch, and passes
afterwards).

But this fix is not perfect - the base-view pairing code really needs
to consider not only the base table's replication map, but also the
view table's replication map - as those can be different. We'll fix
this remaining problem as a followup in a separate patch - it will
require a substantially more elaborate test to reproduce the need
for the different mapping and to verify that fix.

Fixes #16209.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#16211
2023-11-29 15:29:17 +01:00
Pavel Emelyanov
3471f30b58 view_update_generator: Unplug from database later
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.

This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.

fixes: #16001

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#16091
2023-11-20 11:47:55 +02:00
Benny Halevy
a1acf6854b everywhere: reduce dependencies on i_partitioner.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:47:44 +02:00
Avi Kivity
ef7db6df99 Merge 'schema_tables: turn view schema fixing code into a sanity check' from Kamil Braun
The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal
with legacy materialized view schemas used for secondary indexes,
schemas which were created before the notion of "computed columns" was
introduced. Back then, secondary index schemas would use a regular
"token" column. Later it became a computed column and old schemas would
be migrated during rolling upgrade.

The migration code was introduced in 2019
(db8d4a0cc6) and then fixed in 2020
(d473bc9b06).

The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming
that users don't try crazy things like upgrading from 2021.X to 2023.X
(which we do not support), all clusters will have already executed the
migration code once they upgrade to 2023.X, meaning we can get rid of
it.

The main motivation of this PR is to get rid of the
`db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft
mode this was the only call to `merge_schema` outside "group 0 code" and
in fact it is unsafe -- it uses locally generated mutations with locally
generated timestamp (`api::new_timestamp()`), so if we actually did it,
we would permanently diverge the group 0 state machine across nodes
(the schema pulling code is disabled in Raft mode). Fortunately, this
should be dead code by now, as explained in the previous paragraph.

The migration code is now turned into a sanity check, if the users
try something crazy, they will get an error instead of silent data
corruption.

Closes scylladb/scylladb#15695

* github.com:scylladb/scylladb:
  view: remove unused `_backing_secondary_index`
  schema_tables: turn view schema fixing code into a sanity check
  schema_tables: make comment more precise
  feature_service: make COMPUTED_COLUMNS feature unconditionally true
2023-10-31 13:23:19 +02:00
Marcin Maliszkiewicz
020a9c931b db: view: run local materialized view mutations on a separate smp service group
When base write triggers mv write and it needs to be send to another
shard it used the same service group and we could end up with a
deadlock.

This fix affects also alternator's secondary indexes.

Testing was done using (yet) not committed framework for easy alternator
performance testing: https://github.com/scylladb/scylladb/pull/13121.
I've changed hardcoded max_nonlocal_requests config in scylla from 5000 to 500 and
then ran:

./build/release/scylla perf-alternator-workloads --workdir /tmp/scylla-workdir/ --smp 2 \
--developer-mode 1 --alternator-port 8000 --alternator-write-isolation forbid --workload write_gsi \
--duration 60 --ring-delay-ms 0 --skip-wait-for-gossip-to-settle 0 --continue-after-error true --concurrency 2000

Without the patch when scylla is overloaded (i.e. number of scheduled futures being close to max_nonlocal_requests) after couple seconds
scylla hangs, cpu usage drops to zero, no progress is made. We can confirm we're hitting this issue by seeing under gdb:

p seastar::get_smp_service_groups_semaphore(2,0)._count
$1 = 0

With the patch I wasn't able to observe the problem, even with 2x
concurrency. I was able to make the process hang with 10x concurrency
but I think it's hitting different limit as there wasn't any depleted
smp service group semaphore and it was happening also on non mv loads.

Fixes https://github.com/scylladb/scylladb/issues/15844

Closes scylladb/scylladb#15845
2023-10-29 18:30:32 +02:00
Kamil Braun
db49ccccb0 view: remove unused _backing_secondary_index
This boolean was only used for a sanity check which was replaced with a
stronger sanity check in the previous commit that doesn't require the
boolean.
2023-10-24 13:33:36 +02:00
Kamil Braun
3976808b12 schema_tables: turn view schema fixing code into a sanity check
The purpose of `maybe_fix_legacy_secondary_index_mv_schema` was to deal
with legacy materialized view schemas used for secondary indexes,
schemas which were created before the notion of "computed columns" was
introduced. Back then, secondary index schemas would use a regular
"token" column. Later it became a computed column and old schemas would
be migrated during rolling upgrade.

The migration code was introduced in 2019
(db8d4a0cc6) and then fixed in 2020
(d473bc9b06).

The fix was present in Enterprise 2022.1 and in OSS 4.5. So, assuming
that users don't try crazy things like upgrading from 2021.X to 2023.X
(which we do not support), all clusters will have already executed the
migration code once they upgrade to 2023.X, meaning we can get rid of
it.

The main motivation of this patch is to get rid of the
`db::schema_tables::merge_schema` call in `parse_schema_tables`. In Raft
mode this was the only call to `merge_schema` outside "group 0 code" and
in fact it is unsafe -- it uses locally generated mutations with locally
generated timestamp (`api::new_timestamp()`), so if we actually did it,
we would permanently diverge the group 0 state machine across nodes
(the schema pulling code is disabled in Raft mode). Fortunately, this
should be dead code by now, as explained in the previous paragraph.

The migration code is now turned into a sanity check, if the users
try something crazy, they will get an error instead of silent data
corruption.
2023-10-24 13:33:35 +02:00
Jan Ciolek
940e44f887 db/view: change log level of failed view updates to WARN
When a remote view update doesn't succeed there's a log message
saying "Error applying view update...".
This message had log level ERROR, but it's not really a hard error.
View updates can fail for a multitude of reasons, even during normal operation.
A failing view update isn't fatal, it will be saved as a view hint a retried later.

Let's change the log level to WARN. It's something that shouldn't happen too much,
but it's not a disaster either.
ERROR log level causes trouble in tests which assume that an ERROR level message
means that the test has failed.

Refs: https://github.com/scylladb/scylladb/issues/15046#issuecomment-1712748784

For local view updates the log level stays at "ERROR", local view updates shouldn't fail.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes scylladb/scylladb#15640
2023-10-11 18:19:23 +03:00
Pavel Emelyanov
becd960ae8 view_update_generator: Add logging to do_abort()
Just tell the logs that the guy is aborting
refs: #10941

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:34:21 +03:00
Pavel Emelyanov
967ebacaa4 view_update_generator: Move abort kicking to do_abort()
When v.u.g. stops is first aborts the generation background fiber by
requesting abort on the internal abort source and signalling the fiber
in case it's waiting. Right now v.u.g.::stop() is defer-scheduled last
in main(), so this move doesn't change much -- when stop_signal fires,
it will kick the v.u.g.::do_abort() just a bit earlier, there's nothing
that would happen after it before real ::stop() is called that depends
on it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:32:45 +03:00
Pavel Emelyanov
e34220ebb7 view_update_generator: Add early abort subscription
Subscribe v.u.g. to the main's stop_signal. For now a no-op callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-21 13:32:45 +03:00
Benny Halevy
2c54d7a35a view, storage_proxy: carry effective_replication_map along with endpoints
When sending mutation to remote endpoint,
the selected endpoints must be in sync with
the current effective_replication_map.

Currently, the endpoints are sent down the storage_proxy
stack, and later on an effective_replication_map is retrieved
again, and it might not match the target or pending endpoints,
similar to the case seen in https://github.com/scylladb/scylladb/issues/15138

The correct way is to carry the same effective replication map
used to select said endpoints and pass it down the stack.
See also https://github.com/scylladb/scylladb/pull/15141

Fixes scylladb/scylladb#15144
Fixes scylladb/scylladb#14730

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15142
2023-08-29 09:08:42 +03:00
Benny Halevy
8fbcf1ab9f view: start: ignore also abort_requested_exception
We see the abort_requested_exception error from time
to time, instead of sleep_aborted that was expected
and quietly ignored (in debug log level).

Treat abort_requested_exception the same way since
the error is expected on shutdown and to reduce
test flakiness, as seen for example, in
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3033/artifact/logs-full.release.010/1691896356104_repair_additional_test.py%3A%3ATestRepairAdditional%3A%3Atest_repair_schema/node2.log
```
INFO  2023-08-13 03:12:29,151 [shard 0] compaction_manager - Asked to stop
WARN  2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Got error in the loop, live_nodes={}: seastar::sleep_aborted (Sleep is aborted)
INFO  2023-08-13 03:12:29,152 [shard 0] gossip - failure_detector_loop: Finished main loop
WARN  2023-08-13 03:12:29,152 [shard 0] cdc - Aborted update CDC description table with generation (2023/08/13 03:12:17, d74aad4b-6d30-4f22-947b-282a6e7c9892)
INFO  2023-08-13 03:12:29,152 [shard 1] compaction_manager - Asked to stop
INFO  2023-08-13 03:12:29,152 [shard 1] compaction_manager - Stopped
INFO  2023-08-13 03:12:29,153 [shard 0] init - Signal received; shutting down
INFO  2023-08-13 03:12:29,153 [shard 0] init - Shutting down view builder ops
INFO  2023-08-13 03:12:29,153 [shard 0] view - Draining view builder
INFO  2023-08-13 03:12:29,153 [shard 1] view - Draining view builder
INFO  2023-08-13 03:12:29,153 [shard 0] compaction_manager - Stopped
ERROR 2023-08-13 03:12:29,153 [shard 0] view - start failed: seastar::abort_requested_exception (abort requested)
ERROR 2023-08-13 03:12:29,153 [shard 1] view - start failed: seastar::abort_requested_exception (abort requested)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15029
2023-08-13 18:39:09 +03:00
Botond Dénes
4a02865ea1 Merge 'Prevent invalidation of iterators over database::_column_families' from Aleksandra Martyniuk
Maps related to column families in database are extracted
to a column_families_data class. Access to them is possible only
through methods. All methods which may preempt hold rwlock
in relevant mode, so that the iterators can't become invalid.

Fixes: #13290

Closes #13349

* github.com:scylladb/scylladb:
  replica: make tables_metadata's attributes private
  replica: add methods to get a filtered copy of tables map
  replica: add methods to check if given table exists
  replica: add methods to get table or table id
  replica: api: return table_id instead of const table_id&
  replica: iterate safely over tables related maps
  replica: pass tables_metadata to phased_barrier_top_10_counts
  replica: add methods to safely add and remove table
  replica: wrap column families related maps into tables_metadata
  replica: futurize database::add_column_family and database::remove
2023-07-31 15:31:59 +03:00
Botond Dénes
d66b07823b db/view/view_updating_consumer: account for the size of mutations
All partitions will have a corresponding mutation object in the buffer.
These objects have non-negligible sizes, yet the consumer did not bump
the _buffer_size when a new partition was consumer. This resulted in
empty partitions not moving the _buffer_size at all, and thus they could
accumulate without bounds in the buffer, never triggering a flush just
by themselves. We have recently seen this causing OOM.
This patch fixes that by bumping the _buffer_size with the size of the
freshly created mutation object.
2023-07-26 03:07:25 -04:00
Aleksandra Martyniuk
cdbfa0b2f5 replica: iterate safely over tables related maps
Loops over _column_families and _ks_cf_to_uuid which may preempt
are protected by reader mode of rwlock so that iterators won't
get invalid.
2023-07-25 17:13:04 +02:00