Commit Graph

37776 Commits

Author SHA1 Message Date
Kamil Braun
5779230d28 group0_state_machine: use correct comparison for timeuuids in merger
In d2a4079bbe, `merger` was modified so
that when we merge a command, `last_group0_state_id` is taken to be the
maximum of the merged command's state_id and the current
`last_group0_state_id`. This is necessary for achieving the same
behavior as if the commands were applied individually instead of being
merged -- where we take the maximum state ID from `group0_history` table
which was applied until now (because the table is sorted using the state
IDs and we take the greatest row).

However, a subtle bug was introduced -- the `std::max` function uses the
`utils::UUID` standard comparison operator which is unfortunately not
the same as timeuuid comparison that Scylla performs when sorting the
`group0_history` table. So in rare cases it could return the *smaller*
of the two timeuuids w.r.t. the correct timeuuid ordering. This would
then lead to commands being applied which should have been turned to
no-ops due to the `prev_state_id` check -- and then, for example,
permanent schema desync or worse.

Fix it by using the correct comparison method.

Fixes: #14600
2023-07-11 11:48:02 +02:00
Kamil Braun
5ce802676f utils/UUID: introduce timeuuid_tri_compare for const UUID&
The existing `timeuuid_tri_compare` operates on UUIDs serialized in byte
buffers. Introduce a version which operates directly on the
`utils::UUID` type.

To reuse existing comparison code, we serialize to a buffer before
comparing. But we avoid allocations by using `std::array`. Since the
serialized size needs to be known at compile time for `std::array`, mark
`UUID::serialized_size()` as `constexpr`.
2023-07-11 11:48:02 +02:00
Kamil Braun
668beedadc utils/UUID: introduce timeuuid_tri_compare for const int8_t*
`timeuuid_tri_compare` takes `bytes_view` parameters and converts them
to `const int8_t*` before comparing.

Extract the part that operates on `const int8_t*` to separate function
which we will reuse in a later commit.
2023-07-11 11:48:02 +02:00
Tomasz Grabiec
65a5942ec0 Merge 'Fix bootstrap "wait for UP/NORMAL nodes" to handle ignored nodes, recently replaced nodes, and recently changed IPs' from Kamil Braun
Before this PR, the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).

Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

It turns out that we don't need to calculate `sync_nodes` (and
hence `ignore_nodes`) in order to wait for NORMAL state handlers. We
can wait for handlers to finish for *any* `NORMAL`/`shutdown` entries
appearing in gossiper, even those that correspond to dead/ignored
nodes and obsolete IPs.  `handle_state_normal` is called, and
eventually finishes, for all of them.
`wait_for_normal_state_handled_on_boot` no longer receives a set of
nodes as parameter and is modified appropriately, it's now calculating
the necessary set of nodes on each retry (the set may shrink while
we're waiting, e.g. because an entry corresponding to a node that was
replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in #14487, but the test
would still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

We also introduce regression tests with necessary extensions
to the test framework.

Fixes #14468
Fixes #14487

Closes #14507

* github.com:scylladb/scylladb:
  test: rename `test_topology_ip.py` to `test_replace.py`
  test: test bootstrap after IP change
  test: scylla_cluster: return the new IP from `change_ip` API
  test: node replace with `ignore_dead_nodes` test
  test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`
  storage_service: remove `get_nodes_to_sync_with`
  storage_service: use `token_metadata` to calculate nodes waited for to be UP
  storage_service: don't calculate `ignore_nodes` before waiting for normal handlers
2023-07-10 00:28:20 +02:00
Kefu Chai
1eb76d93b7 streaming: cast the progress to a float before formatting it
before this change, we format a `long` using `{:f}`. fmtlib would
throw an exception when actually formatting it.

so, let's make the percentage a float before formatting it.

Fixes #14587
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14588
2023-07-10 00:00:40 +03:00
Kefu Chai
894039d444 build: drop the warning on -O0 might fail tests
Michał Chojnowski noted that this is not true. -O0 almost doubles
the run time of `./test.py --mode=debug`. but it does not fail
any of the tests.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14456
2023-07-09 23:23:12 +03:00
Avi Kivity
850d759fd9 Merge 'repair: optimise repair reader with different shard count' from Gusev Petr
Consider a cluster with no data, e.g. in tests. When a new node is bootstrapped with repair we iterate over all (shard, table, range), read data from all the peer nodes for the range, look for any discrepancies and heal them. Even for small num_tokens (16 in the tests) the number of affected ranges (those we need to consider) amounts to total number of tokens in the cluster, which is 32 for the second node and 48 for the third. Multiplying this by the number of shards and the number of tables in each keyspace gives thousands of ranges. For each of them we need to follow some row level repair protocol, which includes several RPC exchanges between the peer nodes and creating some data structures on them. These exchanges are processed sequentially for each shard, there are `parallel_for_each` in code, but they are throttled by the choosen memory constraints and in fact execute sequentially.

When the bootstrapping node (master) reaches a peer node and asks for data in the specific range and master shard, two options exist. If sharder parameters (primarily, `--smp`) are the same on the master and on the peer, we can just read one local shard, this is fast. If, on the other hand, `--smp` is different, we need to do a multishard query. The given range from the master can contain data from different peer shards, so we split this range into a number of subranges such that each of them contain data only from the given master shard (`dht::selective_token_range_sharder`). The number of these subranges can be quite big (300 in the tests). For each of these subranges we do `fast_forward_to` on the `multishard_reader`, and this incurs a lot of overhead, mainly becuse of `smp::submit_to`.

In this series we optimize this case. Instead of splitting the master range and reading only what's needed, we read all the data in the range and then apply the filter by the master shard. We do this if the estimated number of partitions is small (<=100).

This is the logs of starting a second node with `--smp 4`, first node was `--smp 3`:

```
with this patch
    20:58:49.644 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.222.46.3 in scylla-2...
    20:59:22.713 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.222.46.3 in scylla-2, pid 1132859

without this patch
    21:04:06.424 INFO> [debug/topology_custom.test_topology_smp.1] starting server at host 127.181.31.3 in scylla-2...
    21:06:01.287 INFO> [debug/topology_custom.test_topology_smp.1] started server at host 127.181.31.3 in scylla-2, pid 1134140
```

Fixes: #14093

Closes #14178

* github.com:scylladb/scylladb:
  repair_test: add test_reader_with_different_strategies
  repair: extract repair_reader declaration into reader.hh
  repair_meta: get_estimated_partitions fix
  repair_meta: use multishard_filter reader if the number of partitions is small
  repair_meta: delay _repair_reader creation
  database.hh: make_multishard_streaming_reader with range parameter
  database.cc: extract streaming_reader_lifecycle_policy
2023-07-09 23:21:06 +03:00
Aleksandra Martyniuk
61dc98b276 api: prevent non-owner cpu access to shared_ptr
In get_sstables_for_key in api/column_family.cc a set of lw_shared_ptrs
to sstables is passes to reducer of map_reduce0. Reducer then accesses
these shared pointers. As reducer is invoked on the same shard
map_reduce0 is called, we have an illegal access to shared pointer
on non-owner cpu.

A set of shared pointers to sstables is trasnsformed in map function,
which is guaranteed to be invoked on a shard associated with the service.

Fixes: #14515.

Closes #14532
2023-07-09 23:09:59 +03:00
Kefu Chai
7a334c53af cql3: expression: correct format string
fmtlib uses `{}` as the placeholder for the formatted argument, not
`{}}`.

so let's correct it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14586
2023-07-09 22:26:29 +03:00
Kefu Chai
56c3462cba alternator: correct format string
when formatting the error message for `api_error::validation`, we
always include the caller in the error message, but in this case,
forgot to pass the `caller` to `seastar::format()`. if fmtlib
actually formats them, it would throw.

so let's pass `caller` to `seastar::format()`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14589
2023-07-09 22:25:13 +03:00
Michał Chojnowski
c41f0ebd2a test: mutation_test: unflake test_external_memory_usage
The test has about 1/2500000 chance to fail due to a conflict of random
values. And it recently did, just to spite us.

Fight back.

Fixes #14563

Closes #14576
2023-07-08 15:20:25 +03:00
Kefu Chai
27d6ff36df compound_compat: do not format an sstring with {:d}
before this change, we format a sstring with "{:d}", fmtlib would throw
`fmt::format_error` at runtime when formatting it. this is not expected.

so, in this change, we just print the int8_t using `seastar::format()`
in a single pass. and with the format specifier of `#02x` instead of
adding the "0x" prefix manually.

Fixes #14577
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14578
2023-07-08 15:13:11 +03:00
Kefu Chai
26dcfea84a estimated_histogram: do not use dynamic format_string
fmtlib allows us to specify the field width dynamically, so specify
the field width in the same statement formatting the argument improves
the readability. and use the constexpr fmt string allows us to switch
to compile-time formatter supported by fmtlib v8.

this change also use `fmt::print()` to format the argument right to
the output ostream, instead of creating a temporary sstring, and
copy it to the output ostream.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14579
2023-07-08 15:10:41 +03:00
Anna Stuchlik
88e62ec573 doc: improve User Data info in Launch on AWS
Fixes https://github.com/scylladb/scylladb/issues/14565

This commit improves the description of ScyllaDB configuration
via User Data on AWS.
- The info about experimental features and developer mode is removed.
- The description of User Data is fixed.
- The example in User Data is updated.
- The broken link is fixed.

Closes #14569
2023-07-07 16:34:06 +02:00
Kamil Braun
de7f668441 Merge 'raft topology: send cdc generation data in parts' from Mikołaj Grzebieluch
The CDC generation data can be large and not fit in a single command.
This pr splits it into multiple mutations by smartly picking a
`mutation_size_threshold` and sending each mutation as a separate group
0 command.

Commands are sent sequentially to avoid concurrency problems.

Topology snapshots contain only mutation of current CDC generation data
but don't contain any previous or future generations. If a new
generation of data is being broadcasted but hasn't been entirely applied
yet, the applied part won't be sent in a snapshot. New or delayed nodes
can never get the applied part in this scenario.

Send the entire cdc_generations_v3 table in the snapshot to resolve this
problem.

A mechanism to remove old CDC generations will be introduced as a
follow-up.

Closes #13962

* github.com:scylladb/scylladb:
  test: raft topology: test `prepare_and_broadcast_cdc_generation_data`
  service: raft topology: print warning in case of `raft::commit_status_unknown` exception in topology coordinator loop
  raft topology: introduce `prepare_and_broadcast_cdc_generation_data`
  raft: add release_guard
  raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
  raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
  raft topology: make `mutation_size_threshold` depends on `max_command_size`
  raft: reduce max batch size of raft commands and raft entries
  raft: add description argument to add_entry_unguarded
  raft: introduce `write_mutations` command
  raft: refactor `topology_change` applying
2023-07-07 16:31:29 +02:00
Kamil Braun
f9cfd7e4f5 Merge 'raft: do not ping self in direct failure detector' from Konstantin Osipov
Avoid pinging self in direct failure detector, this adds confusing noise and adds constant overhead.
Fixes #14388

Closes #14558

* github.com:scylladb/scylladb:
  direct_fd: do not ping self
  raft: initialize raft_group_registry with host id early
  raft: code cleanup
2023-07-07 14:26:17 +02:00
Mikołaj Grzebieluch
4e3c97d8d4 test: raft topology: test prepare_and_broadcast_cdc_generation_data
This test limits `commitlog_segment_size_in_mb` to 2, thus `max_command_size`
is limited to less than 1 MB. It adds an injection which copies mutations
generated by `get_cdc_generation_mutations` n times, where n is picked that
the memory size of all mutations exceeds `max_command_size`.

This test passes if cdc generation data is committed by raft in multiple commands.
If all the data is committed in a single command, the leader node will loop trying
to send raft command and getting the error:
```
storage_service - raft topology: topology change coordinator fiber got error raft::command_is_too_big_error (Command size {} is greater than the configured limit {})
```
2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch
8d6c95f9e3 service: raft topology: print warning in case of raft::commit_status_unknown exception in topology coordinator loop
When the topology_cooridnator fiber gets `raft::commit_status_unknown`, it
prints an error. This exception is not an error in this case, and it can be
thrown when the leader has changed. It can happen in `add_entry_unguarded`
while sending a part of the CDC generation data in the `write_mutations` command.

Catch this exception in `topology_coordinator::run` and print a warning.
2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch
ade15ad74a raft topology: introduce prepare_and_broadcast_cdc_generation_data
Broadcasts all mutations returned from `prepare_new_cdc_generation_data`
except the last one. Each mutation is sent in separate raft command. It takes
`group0_guard`, and if the number of mutations is greater than one, the guard
is dropped, and a new one is created and returned, otherwise the old one will
be returned. Commands are sent in parallel and unguarded (the guard used for
sending the last mutation will guarantee that the term hasn't been changed).
Returns the generation's UUID, guard and last mutation, which will be sent
with additional topology data by the caller.

If we send the last mutation in the `write_mutation` command, we would use a
total of `n + 1` commands instead of `n-1 + 1` (where `n` is the number of
mutations), so it's better to send it in `topology_change` (we need to send
it after all `write_mutations`) with some small metadata.

With the default commitlog segment size, `mutation_size_threshold` will be 4 MB.
In large clusters e.g. 100 nodes, 64 shards per node, 256 vnodes cdc generation
data can reach the size of 30 MB, thus there will be no more than 8 commands.

In a multi-DC cluster with 100ms latencies between DCs, this operation should
take about 200ms since we send the commands concurrently, but even if the commands
were replicated sequentially by Raft, it should take no more than 1.6s, which is
incomparably smaller than bootstrapping operation (bootstrapping is quick if there
is no data in the cluster, but usually if one has 100 nodes they have tons of data,
so indeed streaming/repair will take much longer (hours/days)).

Fixes FIXME in pr #13683.
2023-07-07 13:56:35 +02:00
Mikołaj Grzebieluch
04c38c6185 raft: add release_guard
This function takes guard and calls its destructor. It's used to not call raw destructor.
2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch
d2a4079bbe raft: group0_state_machine::merger take state_id as the maximal value from all merged commands
If `group0_state_machine` applies all commands individually (without batching),
the resulting current `state_id` -- which will be compared with the
`prev_state_id` of the next command if it is a guarded command -- equals the
maximum of the `next_state_id` of all commands applied up to this point.
That's because the current `state_id` is obtained from the history table by
taking the row with the largest clustering key.

When `group0_state_machine::apply` is called with a batch of commands, the
current `state_id` is loaded from `system.group0_history` to `merger::last_group0_state_id`
only once. When a command is merged, its `next_state_id` overwrites
`last_group0_state_id`, regardless of their order.

Let's consider the following situation:
The leader sends two unguarded `write_mutations` commands concurrently, with
timeuuids T1 and T2, where T1 < T2. Leader waits to apply them and sends guarded
`topology_change` with `prev_state_id` equal T2.
Suppose that the command with timeuuid T2 is committed first, and these commands
are small enough that all of `write_mutations` could be merged into one command.
Some followers can get all of these three commands before its `fsm` polls them.
In this situation, `group0_state_machine::apply` is called with all three of
them and `merger` will merge both `write_mutations` into one command. After that,
`merger::last_group0_state_id` will be equal to T1 (this command was committed
as the second one). When it processes the `topology_change` command, it will
compare its `prev_state_id` and `merger::last_group0_state_id`, resulting in
making this command a no-op (which wouldn't happen if the commands were applied
individually).
Such a scenario results in inconsistent results: one replica applies `topology_change`,
but another makes it a no-op.
2023-07-07 13:49:25 +02:00
Mikołaj Grzebieluch
b2d22d665e raft topology: include entire cdc_generations_v3 table in cdc_generation_mutations snapshot
Topology snapshots contain only mutation of current CDC generation data but don't
contain any previous or future generations. If new a generation of data is being
broadcasted but hasn't been entirely applied yet, the applied part won't be sent
in a snapshot. In this scenario, new or delayed nodes can never get the applied part.

Send entire cdc_generations_v3 table in the snapshot to resolve this problem.

As a follow-up, a mechanism to remove old CDC generations will be introduced.
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
dc6017b71b raft topology: make mutation_size_threshold depends on max_command_size
`get_cdc_generation_mutations` splits data to mutations of maximal size
`mutation_size_treshold`. Before this commit it was hardcoded to 2 MB.

Calculate `mutation_size_threshold` to leave space for cdc generation
data and not exceed `max_command_size`.
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
6dad582796 raft: reduce max batch size of raft commands and raft entries
For now, `raft_sys_table_storage::_max_mutation_size` equals `max_mutation_size`
(half of the commitlog segment size), so with some additional information, it
can exceed this threshold resulting in throwing an exception when writing
mutation to the commitlog.

A batch of raft commands has the size at most `group0_state_machine::merger::max_command_size`
(half of the commitlog segment size). It doesn't have additional metadata, but
it may have a size of exactly `max_mutation_size`. It shouldn't make any trouble,
but it is prefered to be careful.

Make `raft_sys_table_storage::_max_mutation_size` and
`group0_state_machine::merger::max_command_size` more strict to leave space
for metadata.

Fixed typo "1204" => "1024".
2023-07-07 13:11:52 +02:00
Mikołaj Grzebieluch
760d415781 raft: add description argument to add_entry_unguarded
Provide useful description for `write_mutations` and
`broadcast_tables_query` that is stored in `system.group0_history`.

Reduces scope of issue #13370.
2023-07-07 13:11:44 +02:00
Anna Stuchlik
799ae97b52 doc: add the Rust CDC Connector to the docs
Fixes https://github.com/scylladb/scylladb/issues/13877

This commit adds the information about Rust CDC Connector
to the documentation. All relevant pages are updated:
the ScyllaDB Rust Driver page, and other places in
the docs where Java and Go CDC connectors are mentioned.

In addition, the drivers table is updated to indicate
Rust driver support for CDC.

Closes #14530
2023-07-07 11:13:25 +02:00
Nadav Har'El
edfb89ef65 sstables: stop warning when auto-snapshot leaves non-empty directory
When a table is dropped, we delete its sstables, and finally try to delete
the table's top-level directory with the rmdir system call. When the
auto-snapshot feature is enabled (this is still Scylla's default),
the snapshot will remain in that directory so it won't be empty and will
cannot be removed. Today, this results in a long, ugly and scary warning
in the log:

```
WARN  2023-07-06 20:48:04,995 [shard 0] sstable - Could not remove table directory "/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots": std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [/tmp/scylla-test-198265/data/alternator_alternator_Test_1688665684546/alternator_Test_1688665684546-4238f2201c2511eeb15859c589d9be4d/snapshots]). Ignored.
```

It is bad to log as a warning something which is completely normal - it
happens every time a table is dropped with the perfectly valid (and even
default) auto-snapshot mode. We should only log a warning if the deletion
failed because of some unexpected reason.

And in fact, this is exactly what the code **tried** to do - it does
not log a warning if the rmdir failed with EEXIST. It even had a comment
saying why it was doing this. But the problem is that in Linux, deleting
a non-empty directory does not return EEXIST, it returns ENOTEMPTY...
Posix actually allows both. So we need to check both, and this is the
only change in this patch.

To confirm this that this patch works, edit test/cql-pytest/run.py and
change auto-snapshot from 0 to 1, run test/alternator/run (for example)
and see many "Directory not empty" warnings as above. With this patch,
none of these warnings appear.

Fixes #13538

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14557
2023-07-07 11:08:10 +02:00
Benny Halevy
cd44ad9338 docs: compaction: correct min_sstable_size default value
DEFAULT_MIN_SSTABLE_SIZE is defined as `50L * 1024L * 1024L`
which is 50 MB, not 50 bytes.

Fixes #14413

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14414
2023-07-07 11:08:10 +02:00
Marcin Maliszkiewicz
c5de25be4c locator: use deferred_close in azure and gcp snitches
Close needs to be called even if function throws in the middle.

Closes #14458
2023-07-07 11:08:10 +02:00
Avi Kivity
1f9a999c26 cql3: statement_restrictions: clean up dead code
We have plenty of code marked with #if 0. Once it was an indication
of missing functionality, but the code has evolved so much it's
useless as an indication and only a distraction.

Delete it.

Closes #14511
2023-07-07 11:08:10 +02:00
Gleb Natapov
4f23eec44f Rename experimental raft feature to consistent-topology-changes
Make the name more descriptive

Fixes #14145

Message-Id: <ZKQ2wR3qiVqJpZOW@scylladb.com>
2023-07-07 11:08:10 +02:00
Kamil Braun
3c139265b3 Merge 'doc: remove the dead link to unirestore' from Anna Stuchlik
Fixes https://github.com/scylladb/scylladb/issues/14459

This PR removes the (dead) link to the unirestore tool in a private repository. In addition, it adds minor language improvements.

Closes #14519

* github.com:scylladb/scylladb:
  doc: minor language improvements on the Migration Tools page
  doc: remove the link to the private repository
2023-07-07 11:08:10 +02:00
Nadav Har'El
d6aba8232b alternator: configurable override for DescribeEndpoints
The AWS C++ SDK has a bug (https://github.com/aws/aws-sdk-cpp/issues/2554)
where even if a user specifies a specific enpoint URL, the SDK uses
DescribeEndpoints to try to "refresh" the endpoint. The problem is that
DescribeEndpoints can't return a scheme (http or https) and the SDK
arbitrarily picks https - making it unable to communicate with Alternator
over http. As an example, the new "dynamodb shell" (written in C++)
cannot communicate with Alternator running over http.

This patch adds a configuration option, "alternator_describe_endpoints",
which can be used to override what DescribeEndpoints does:

1. Empty string (the default) leaves the current behavior -
   DescribeEndpoints echos the request's "Host" header.

2. The string "disabled" disables the DescribeEndpoints (it will return
   an UnknownOperationException). This is how DynamoDB Local behaves,
   and the AWS C++ SDK and the Dynamodb Shell work well in this mode.

3. Any other string is a fixed string to be returned by DescribeEndpoints.
   It can be useful in setups that should return a known address.

Note that this patch does not, by default, change the current behaivor
of DescribeEndpoints. But it us the future to override its behavior
in a user experiences problems in the field - without code changes.

Fixes #14410.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes #14432
2023-07-07 11:08:10 +02:00
Konstantin Osipov
ff41ea86b6 direct_fd: do not ping self
No need to ping self in direct failure detector. This is confusing
during debugging and adds extra overhead.

Fixes #14388
2023-07-06 21:05:39 +03:00
Konstantin Osipov
50140980ac raft: initialize raft_group_registry with host id early
Earlier, when local query processor wasn't available at
the beginning of system start, we couldn't query our own
host id when initializing the raft group registry. The local
host id is needed by the registry since it is responsible
to route RPC messages to specific raft groups, and needs
to reject messages destined to a different host.

Now that the host id is known early at boot, remove the optional
and pass host id in the constructor. Resolves an earlier fixme.
2023-07-06 20:54:05 +03:00
Konstantin Osipov
d79d05aa46 raft: code cleanup
Rename raft_rpc::_server_id to raft_rpc::_my_id as is already the
name used in raft_group0:
- for consistency
- to reflect which server id it is.
2023-07-06 19:46:24 +03:00
Kamil Braun
0d437a7d63 Merge 'utils: error injection: add inject_with_handler for interactions with injected code' from Mikołaj Grzebieluch
Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint.

This PR adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message.

Closes #14357.

Closes #14460

* github.com:scylladb/scylladb:
  tests: introduce InjectionHandler class for communicating with injected code
  api/error_injection: add message_injection endpoint
  tests: utils: error injections: add test for inject_with_handler
  utils: error injection: add inject_with_handler for interactions with injected code
  utils: error injection: create structure for error injections data
2023-07-06 18:16:51 +02:00
Mikołaj Grzebieluch
907c0e8900 tests: introduce InjectionHandler class for communicating with injected code
Add a client for sending empty messages to the injected code from tests.
2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch
8b1f5ba293 api/error_injection: add message_injection endpoint
Add an endpoint for sending empty messages to the injected code.
2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch
7e5c42af0a tests: utils: error injections: add test for inject_with_handler
Add a test checking the correctness of the `inject_with_handler` method
in presence of concurrency.
2023-07-06 12:34:53 +02:00
Mikołaj Grzebieluch
086b3369f4 utils: error injection: add inject_with_handler for interactions with injected code
Currently, it is hard for injected code to wait for some events, for example,
requests on some REST endpoint.

This commit adds the `inject_with_handler` method that executes injected function
and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers
associated with the injection on all shards about the received message.
There is a counter of received messages in `received_messages_counter`; it is shared
between the injection_data, which is created once when enabling an injection on
a given shard, and all `injection_handler`s, that are created separately for each
firing of this injection. The `counter` is incremented when receiving a message from
the REST endpoint and the condition variable is signaled.
Each `injection_handler` (separate for each firing) stores its own private counter,
`_read_messages_counter` that private counter is incremented whenever we wait for a
message, and compared to the received counter. We sleep on the condition variable
if not enough messages were received.
2023-07-06 12:32:07 +02:00
Kamil Braun
431a8f8591 test: rename test_topology_ip.py to test_replace.py
No idea why it was named like that before.
2023-07-06 10:24:46 +02:00
Kamil Braun
452d9a3c77 test: test bootstrap after IP change
Regression test for #14468.
2023-07-06 10:24:46 +02:00
Kamil Braun
2032d7dbe4 test: scylla_cluster: return the new IP from change_ip API
Also simplify the API by getting rid of `ActionReturn` and returning
errors through exceptions (which are correctly forwarded to the client
for some time already).
2023-07-06 10:24:46 +02:00
Kamil Braun
00f51ea753 test: node replace with ignore_dead_nodes test
Regression test for #14487 on steroids. It performs 3 consecutive node
replace operations, starting with 3 dead nodes.

In order to have a Raft majority, we have to boot a 7-node cluster, so
we enable this test only in one mode; the choice was between `dev` and
`release`, I picked `dev` because it compiles faster and I develop on
it.
2023-07-06 10:24:46 +02:00
Kamil Braun
9b136ee574 test: scylla_cluster: accept ignore_dead_nodes in ReplaceConfig 2023-07-06 10:24:46 +02:00
Kamil Braun
9b8e5550b1 storage_service: remove get_nodes_to_sync_with
It's no longer used.
2023-07-06 10:24:46 +02:00
Kamil Braun
96278a09d4 storage_service: use token_metadata to calculate nodes waited for to be UP
At bootstrap, after we start gossiping, we calculate a set of nodes
(`sync_nodes`) which we need to "synchronize" with, waiting for them to
be UP before proceeding; these nodes are required for streaming/repair
and CDC generation data write, and generally are supposed to constitute
the current set of cluster members.

In #14468 and #14487 we observed that this set may calculate entries
corresponding to nodes that were just replaced or changed their IPs
(but the old-IP entry is still there). We pass them to
`_gossiper.wait_alive` and the call eventually times out.

We need a better way to calculate `sync_nodes` which detects ignores
obsolete IPs and nodes that are already gone but just weren't
garbage-collected from gossiper state yet.

In fact such a method was already introduced in the past:
ca61d88764
but it wasn't used everywhere. There, we use `token_metadata` in which
collisions between Host IDs and tokens are resolved, so it contains only
entries that correspond to the "real" current set of NORMAL nodes.

We use this method to calculate the set of nodes passed to
`_gossiper.wait_alive`.

Fixes #14468
Fixes #14487
2023-07-06 10:24:46 +02:00
Kamil Braun
bbcf8305bb storage_service: don't calculate ignore_nodes before waiting for normal handlers
Before this commit the `wait_for_normal_state_handled_on_boot` would
wait for a static set of nodes (`sync_nodes`), calculated using the
`get_nodes_to_sync_with` function and `parse_node_list`; the latter was
used to obtain a list of "nodes to ignore" (for replace operation) and
translate them, using `token_metadata`, from IP addresses to Host IDs
and vice versa. `sync_nodes` was also used in `_gossiper.wait_alive` call
which we do after `wait_for_normal_state_handled_on_boot`.

Recently we started doing these calculations and this wait very early in
the boot procedure - immediately after we start gossiping
(50e8ec77c6).

Unfortunately, as always with gossiper, there are complications.
In #14468 and #14487 two problems were detected:
- Gossiper may contain obsolete entries for nodes which were recently
  replaced or changed their IPs. These entries are still using status
  `NORMAL` or `shutdown` (which is treated like `NORMAL`, e.g.
  `handle_state_normal` is also called for it). The
  `_gossiper.wait_alive` call would wait for those entries too and
  eventually time out.
- Furthermore, by the time we call `parse_node_list`, `token_metadata`
  may not be populated yet, which is required to do the IP<->Host ID
  translations -- and populating `token_metadata` happens inside
  `handle_state_normal`, so we have a chicken-and-egg problem here.

The `parse_node_list` problem is solved in this commit. It turns out
that we don't need to calculate `sync_nodes` (and hence `ignore_nodes`)
in order to wait for NORMAL state handlers. We can wait for handlers to
finish for *any* `NORMAL`/`shutdown` entries appearing in gossiper, even
those that correspond to dead/ignored nodes and obsolete IPs.
`handle_state_normal` is called, and eventually finishes, for all of
them. `wait_for_normal_state_handled_on_boot` no longer receives a set
of nodes as parameter and is modified appropriately, it's now
calculating the necessary set of nodes on each retry (the set may shrink
while we're waiting, e.g. because an entry corresponding to a node that
was replaced is garbage-collected from gossiper state).

Thanks to this, we can now put the `sync_nodes` calculation (which is
still necessary for `_gossiper.wait_alive`), and hence the
`parse_node_list` call, *after* we wait for NORMAL state handlers,
solving the chickend-and-egg problem.

This addresses the immediate failure described in #14487, but the test
will still fail. That's because `_gossiper.wait_alive` may still receive
a too large set of nodes -- we may still include obsolete IPs or entries
corresponding to replaced nodes in the `sync_nodes` set. We fix this
in the following commit which will solve both issues.
2023-07-06 10:24:44 +02:00
Tomasz Grabiec
c25201c1a3 Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski
View update routines accept `mutation` objects.
But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects.
To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2.

To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error.

This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object).

The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic.

Fixes https://github.com/scylladb/scylladb/issues/14503

Closes #14502

* github.com:scylladb/scylladb:
  test: view_build_test: add range tombstones to test_view_update_generator_buffering
  test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations
  view_updating_consumer: make buffer limit a variable
  view: fix range tombstone handling on flushes in view_updating_consumer
2023-07-05 21:21:43 +02:00