Commit Graph

3539 Commits

Author SHA1 Message Date
Tomasz Grabiec
f2ed9fcd7e schema_mutations, migration_manager: Ignore empty partitions in per-table digest
Schema digest is calculated by querying for mutations of all schema
tables, then compacting them so that all tombstones in them are
dropped. However, even if the mutation becomes empty after compaction,
we still feed its partition key. If the same mutations were compacted
prior to the query, because the tombstones expire, we won't get any
mutation at all and won't feed the partition key. So schema digest
will change once an empty partition of some schema table is compacted
away.

Tombstones expire 7 days after schema change which introduces them. If
one of the nodes is restarted after that, it will compute a different
table schema digest on boot. This may cause performance problems. When
sending a request from coordinator to replica, the replica needs
schema_ptr of exact schema version request by the coordinator. If it
doesn't know that version, it will request it from the coordinator and
perform a full schema merge. This adds latency to every such request.
Schema versions which are not referenced are currently kept in cache
for only 1 second, so if request flow has low-enough rate, this
situation results in perpetual schema pulls.

After ae8d2a550d, it is more liekly to
run into this situation, because table creation generates tombstones
for all schema tables relevant to the table, even the ones which
will be otherwise empty for the new table (e.g. computed_columns).

This change inroduces a cluster feature which when enabled will change
digest calculation to be insensitive to expiry by ignoring empty
partitions in digest calculation. When the feature is enabled,
schema_ptrs are reloaded so that the window of discrepancy during
transition is short and no rolling restart is required.

A similar problem was fixed for per-node digest calculation in
18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation
was not fixed at that time because we didn't persist enabled features
and they were not enabled early-enough on boot for us to depend on
them in digest calculation. Now they are enabled before non-system
tables are loaded so digest calculation can rely on cluster features.

Fixes #4485.
2023-07-03 23:06:55 +02:00
Tomasz Grabiec
0c86abab4d migration_manager, schema_tables: Implement migration_manager::reload_schema()
Will recreate schema_ptr's from schema tables like during table
alter. Will be needed when digest calculation changes in reaction to
cluster feature at run time.
2023-07-03 20:32:59 +02:00
Kamil Braun
ff386e7a44 service: raft: force initial snapshot transfer in new cluster
When we upgrade a cluster to use Raft, or perform manual Raft recovery
procedure (which also creates a fresh group 0 cluster, using the same
algorithm as during upgrade), we start with a non-empty group 0 state
machine; in particular, the schema tables are non-empty.

In this case we need to ensure that nodes which join group 0 receive the
group 0 state. Right now this is not the case. In previous releases,
where group 0 consisted only of schema, and schema pulls were also done
outside Raft, those nodes received schema through this outside
mechanism. In 91f609d065 we disabled
schema pulls outside Raft; we're also extending group 0 with other
things, like topology-specific state.

To solve this, we force snapshot transfers by setting the initial
snapshot index on the first group 0 server to `1` instead of `0`. During
replication, Raft will see that the joining servers are behind,
triggering snapshot transfer and forcing them to pull group 0 state.

It's unnecessary to do this for cluster which bootstraps with Raft
enabled right away but it also doesn't hurt, so we keep the logic simple
and don't introduce branches based on that.

Extend Raft upgrade tests with a node bootstrap step at the end to
prevent regressions (without this patch, the step would hang - node
would never join, waiting for schema).

Fixes: #14066

Closes #14336
2023-06-29 22:46:42 +02:00
Tomasz Grabiec
a9282103ba Merge 'Call storage_service notifications only after keyspace schema changes are applied on all shards' from Benny Halevy
This series aims at hardening schema merges and preventing inconsistencies across shards by
updating the database shards before calling the notification callback.

As seen in #13137, we don't want to call the notifications on all shards in parallel while the database shards are in flux.

In addition, any error to update the keyspace will cause abort so not to leave the database shards in an inconsistent state .

Other changes optimize this path by:
- updating shard 0 first, to seed the effective_replication_map.
- executing `storage_service::keyspace_changed` only once, on shard 0 to prevent quadratic update of the token_metadata and e_r_m on every keyspace change.

Fixes #13137

Closes #14158

* github.com:scylladb/scylladb:
  migration_manager: propagate listener notification exceptions
  storage_service: keyspace_changed: execute only on shard 0
  database: modify_keyspace_on_all_shards: execute func first on shard 0
  database: modify_keyspace_on_all_shards: call notifiers only after applying func on all shards
  database: add modify_keyspace_on_all_shards
  schema_tables: merge_keyspaces: extract_scylla_specific_keyspace_info for update_keyspace
  database: create_keyspace_on_all_shards
  database: update_keyspace_on_all_shards
  database: drop_keyspace_on_all_shards
2023-06-29 12:17:53 +02:00
Tomasz Grabiec
50e8ec77c6 Merge 'Wait for other nodes to be UP and NORMAL on bootstrap right after enabling gossiping' from Kamil Braun
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.

As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.

This should improve CI stability.
Fixes: #12972
Refs: #14042

Closes #14354

* github.com:scylladb/scylladb:
  messaging_service: print which connections are dropped due to missing topology info
  storage_service: wait for nodes to be UP on bootstrap
  storage_service: wait for NORMAL state handler before `setup_group0()`
  storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
2023-06-28 20:40:03 +02:00
Kamil Braun
b912eeade5 Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

* 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla:
  test: add test for group0 raft command merging
  service: raft: respect max mutation size limit when persisting raft entries
  group0_state_machine: merge commands before applying them whenever possible
2023-06-28 17:21:07 +02:00
Kamil Braun
51cec2be86 storage_service: wait for nodes to be UP on bootstrap
The bootstrap procedure is racing with gossiper marking nodes as UP.
If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to
nodes still being marked as DOWN, or the CDC generation write failing.

In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we use it.

Refs: #14042

This doesn't completely fix #14042 yet becasue it's specific to
gossiper-based topology mode only. For Raft-based topology, the node
joining procedure will be coordinated by the topology coordinator right
from the start and it will be the coordinator who issues the 'wait for
node to see other live nodes'.
2023-06-28 16:20:29 +02:00
Kamil Braun
5ec5c7704c storage_service: wait for NORMAL state handler before setup_group0()
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.

Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.

This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.

The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).

We do it only in non-raft-topology mode, because with Raft-based
topology, node state changes are propagated to the cluster through
explicit global barriers and we plan to remove node statuses from
gossiper altogether.

Fixes: #12972
2023-06-28 16:19:24 +02:00
Kamil Braun
64c302e777 storage_service: extract gossiper::wait_for_live_nodes_to_show_up()
This piece of `storage_service::wait_for_ring_to_settle()` will be
performed earlier in the boot procedure in follow-up commits.

Make it more generic, to be able to wait for `n` nodes to show up. Here
we wait for `2` nodes - ourselves and at least one other.
2023-06-28 12:36:06 +02:00
Gleb Natapov
945f476363 test: add test for group0 raft command merging
Add a test that submits 3 large commands each one a little bit larger
than 1/3 of maximum mutation size. Check that in the end 2 command were
executed (first 2 were merged and third was executed separately).
2023-06-27 14:59:55 +03:00
Gleb Natapov
8307b09c64 service: raft: respect max mutation size limit when persisting raft entries
The code that preserves raft entries builds one batch statement to store
all of them, but the butch's statement execute() merges all of the
statements into one mutation and passes it to the database. The mutation
can be larger than max mutation size limit and the write will fail. Fix
it by splitting the write to multiple batch statements if needed.
2023-06-27 14:59:55 +03:00
Gleb Natapov
311cfa1be8 group0_state_machine: merge commands before applying them whenever possible
Since most group0 commands are just mutations it is easy to combine them
before passing them to a subsystem they destined to since it is more
efficient. The logic that handles those mutations in a subsystem will
run once for each batch of commands instead of for each individual
command. This is especially useful when a node catches up to a leader and
gets a lot of commands together.

The patch here does exactly that. It combines commands into a single
command if possible, but it preserves an order between commands, so each
time it encounters a command to a different subsystem it flushes already
combined batch and starts a new one. This extra safety assumes that
there are dependencies between subsystems managed by group0, so the order
matters. It may be not the case now, but we prefer to be on a safe side.

Broadcast table commands are not mutations, so they are never combined.

Fixes: #12581
2023-06-27 14:40:46 +03:00
Benny Halevy
825d617a53 migration_manager: propagate listener notification exceptions
1e29b07e40 claimed
to make event notification exception safe,
but swallawing the exceptions isn't safe at all,
as this might leave the node in an inconsistent state
if e.g. storage_service::keyspace_changed fails on any of the
shards.  Propagating the exception here will cause abort,
but it is better than leaving the node up, but in an
inconsistent state.

We keep notifying other listeners even if any of them failed
Based on 1e29b07e40:
```
If one of the listeners throws an exception, we must ensure that other
listeners are still notified.
```

The decision about swallowing exceptions can't be
made in such a generic layer.
Specific notification listeners that may ignore exceptions,
like in transport/evenet_notifier, may decide to swallow their
local exceptions on their own (as done in this patch).

Refs #3389

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Benny Halevy
a690f0e81f storage_service: keyspace_changed: execute only on shard 0
Previously all shards called `update_topology_change_info`
which in turn calls `mutate_token_metadata`, ending up
in quadratic complexity.

Now that the notifications are called after
all database shards are updated, we can apply
the changes on token metadata / effective replication map
only on shard 0 and count on replicate_to_all_cores to
propagate those changes to all other shards.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-06-26 21:08:09 +03:00
Petr Gusev
1e851262f2 storage_proxy: handler responses, use pointers to default constructed values instead of nulls
The current Seastar RPC infrastructure lacks support
for null values in tuples in handler responses.
In this commit we add the make_default_rpc_tuple function,
which solves the problem by returning pointers to
default-constructed values for smart pointer types
rather than nulls.

The problem was introduced in this commit
 2d791a5ed4. The
 function `encode_replica_exception_for_rpc` used
 `default_tuple_maker` callback to create tuples
 containing exceptions. Callers returned pointers
 to default-constructed values in this callback,
 e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`.
 The commit changed this to just `SourceTuple{}`,
 which means nullptr for pointer types.

Fixes: #14282

Closes #14352
2023-06-26 11:10:38 +03:00
Kamil Braun
e6942d31d3 Merge 'query processor code cleanup' from Gleb
The series contains mostly cleanups for query processor and no functional
change. The last patch is a small cleanup for the storage_proxy.

* 'qp-cleanup' of https://github.com/gleb-cloudius/scylla:
  storage_proxy: remove unused variable
  client_state: co-routinise has_column_family_access function
  query_processor: get rid of internal_state and create individual query_satate for each request
  cql3: move validation::validate_column_family from client_state::has_column_family_access
  client_state: drop unneeded argument from has.*access functions
  cql3: move check for dropping cdc tables from auth to the drop statement code itself
  query_processor: co-routinise execute_prepared_without_checking_exception_message function
  query_processor: co-routinize execute_direct_without_checking_exception_message function
  cql3: remove empty statement::validate functions
  cql3: remove empty function validate_cluster_support
  cql3/statements: fix indentation and spurious white spaces
  query_processor: move statement::validate call into execute_with_params function
  query_processor: co-routinise execute_with_params function
  query_processor: execute statement::validate before each execution of internal query instead of only during prepare
  query_processor: get rid of shared internal_query_state
  query_processor: co-routinize execute_paged_internal function
  query_processor: co_routinize execute_batch_without_checking_exception_message function
  query_processor: co-routinize process_authorized_statement function
2023-06-23 10:32:57 +02:00
Gleb Natapov
94fcba5662 storage_proxy: remove unused variable 2023-06-22 15:26:20 +03:00
Gleb Natapov
caee26ab4f client_state: co-routinise has_column_family_access function 2023-06-22 15:26:20 +03:00
Gleb Natapov
4bad482e4b cql3: move validation::validate_column_family from client_state::has_column_family_access
Checking keyspace/table presence should not be part of authorization code
and it is not done consistently today.  For instance keyspace presence
is not checked in "alter keyspace" during authorization, but during
statement execution. Make it consistent.
2023-06-22 13:57:36 +03:00
Gleb Natapov
31bddb65c7 client_state: drop unneeded argument from has.*access functions
After previous patch we can drop db argument to most of has.*access
functions in the client_state.
2023-06-22 13:57:36 +03:00
Gleb Natapov
06bcce53b5 cql3: move check for dropping cdc tables from auth to the drop statement code itself
Checking if a table is CDC log and cannot be dropped should not be done
as part of authentication (this has nothing to do with auth), but in the
drop statement itself. Throwing unauthorized_exception is wrong as well,
but unfortunately it is enshrined with a test. Not sure if it is a good
idea to change it now.
2023-06-22 13:57:36 +03:00
Botond Dénes
e1c2de4fb8 Merge 'forward_service: fix forgetting case-sensitivity in aggregates ' from Jan Ciołek
There was a bug that caused aggregates to fail when used on column-sensitive columns.

For example:
```cql
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase, but for case sensitive names we have to preserve the name as originally written.

The problem was in `forward_service` - we took a column name and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Closes #14340

* github.com:scylladb/scylladb:
  service/forward_service.cc: make case-sensitivity explicit
  cql-pytest/test_aggregate: test case-sensitive column name in aggregate
  forward_service: fix forgetting case-sensitivity in aggregates
2023-06-22 08:25:33 +03:00
Avi Kivity
8576502c48 Merge 'raft topology: ban left nodes from the cluster' from Kamil Braun
Use the new Seastar functionality for storing references to connections to implement banning hosts that have left the cluster (either decommissioned or using removenode) in raft-topology mode. Any attempts at communication from those nodes will be rejected.

This works not only for nodes that restart, but also for nodes that were running behind a network partition and we removed them. Even when the partition resolves, the existing nodes will effectively put a firewall from that node.

Some changes to the decommission algorithm had to be introduced for it to work with node banning. As a side effect a pre-existing problem with decommission was fixed. Read the "introduce `left_token_ring` state" and "prepare decommission path for node banning" commits for details.

Closes #13850

* github.com:scylladb/scylladb:
  test: pylib: increase checking period for `get_alive_endpoints`
  test: add node banning test
  test: pylib: manager_client: `get_cql()` helper
  test: pylib: ScyllaCluster: server pause/unpause API
  raft topology: ban left nodes
  raft topology: skip `left_token_ring` state during `removenode`
  raft topology: prepare decommission path for node banning
  raft topology: introduce `left_token_ring` state
  raft topology: `raft_topology_cmd` implicit constructor
  messaging_service: implement host banning
  messaging_service: exchange host IDs and map them to connections
  messaging_service: store the node's host ID
  messaging_service: don't use parameter defaults in constructor
  main: move messaging_service init after system_keyspace init
2023-06-21 20:16:45 +03:00
Jan Ciołek
16c21d7252 service/forward_service.cc: make case-sensitivity explicit
Make it explicit that the boolean argument determines case-sensitivity. It emphasizes its importance.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-21 16:02:41 +02:00
Jan Ciolek
7fca350075 forward_service: fix forgetting case-sensitivity in aggregates
There was a bug that caused aggregates to fail when
used on column-sensitive columns.

For example:
```
SELECT SUM("SomeColumn") FROM ks.table;
```
would fail, with a message saying that there
is no column "somecolumn".

This is because the case-sensitivity got lost on the way.

For non case-sensitive column names we convert them to lowercase,
but for case sensitive names we have to preserve the name
as originally written.

The problem was in `forward_service` - we took a column name
and created a non case-sensitive `column_identifier` out of it.
This converted the name to lowercase, and later such column
couldn't be found.

To fix it, let's make the `column_identifier` case-sensitive.
It will preserve the name, without converting it to lowercase.

Fixes: https://github.com/scylladb/scylladb/issues/14307

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2023-06-21 14:37:42 +02:00
Avi Kivity
e233f471b8 Merge 'Respect tablet shard assignment' from Tomasz Grabiec
This PR changes the system to respect shard assignment to tablets in tablet metadata (system.tablets):
1. The tablet allocator is changed to distribute tablets evenly across shards taking into account currently allocated tablets in the system. Each tablet has equal weight. vnode load is ignored.
2. CDC subsystem was not adjusted (not supported yet)
3. sstable sharding metadata reflects tablet boundaries
5. resharding is NOT supported yet (the node will abort on boot if there is a need to reshard tablet-based tables)
6. The system is NOT prepared to handle tablet migration / topology changes in a safe way.
7. Sstable cleanup is not wired properly yet

After this PR, dht::shard_of() and schema::get_sharder() are deprecated. One should use table::shard_of() and effective_replication_map::get_sharder() instead.

To make the life easier, support was added to obtain table pointer from the schema pointer:

```
schema_ptr s;
s->table().shard_of(...)
```

Closes #13939

* github.com:scylladb/scylladb:
  locator: network_topology_startegy: Allocate shards to tablets
  locator: Store node shard count in topology
  service: topology: Extract topology updating to a lambda
  test: Move test_tablets under topology_experimental
  sstables: Add trace-level logging related to shard calculation
  schema: Catch incorrect uses of schema::get_sharder()
  dht: Rename dht::shard_of() to dht::static_shard_of()
  treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
  storage_proxy: Avoid multishard reader for tablets
  storage_proxy: Obtain shard from erm in the read path
  db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
  forward_service: Use table sharder
  alternator: Use table sharder
  db: multishard: Obtain sharder from erm
  sstable_directory: Improve trace-level logging
  db: table: Introduce shard_of() helper
  db: Use table sharder in compaction
  sstables: Compute sstable shards using sharder from erm when loading
  sstables: Generate sharding metadata using sharder from erm when writing
  test: partitioner: Test split_range_to_single_shard() on tablet-like sharder
  dht: Make split_range_to_single_shard() prepared for tablet sharder
  sstables: Move compute_shards_for_this_sstable() to load()
  dht: Take sharder externally in splitting functions
  locator: Make sharder accessible through effective_replication_map
  dht: sharder: Document guarantees about mapping stability
  tablets: Implement tablet sharder
  tablets: Include pending replica in get_shard()
  dht: sharder: Introduce next_shard()
  db: token_ring_table: Filter out tablet-based keyspaces
  db: schema: Attach table pointer to schema
  schema_registry: Fix SIGSEGV in learn() when concurrent with get_or_load()
  schema_registry: Make learn(schema_ptr) attach entry to the target schema
  test: lib: cql_test_env: Expose feature_service
  test: Extract throttle object to separate header
2023-06-21 10:20:41 +03:00
Calle Wilund
f18e967939 storage_proxy: Make split_stats resilient to being called from different scheduling group
Fixes #11017

When doing writes, storage proxy creates types deriving from abstract_write_response_handler.
These are created in the various scheduling groups executing the write inducing code. They
pick up a group-local reference to the various metrics used by SP. Normally all code
using (and esp. modifying) these metrics are executed in the same scheduling group.
However, if gossip sees a node go down, it will notify listeners, which eventually
calls get_ep_stat and register_metrics.
This code (before this patch) uses _active_ scheduling group to eventually add
metrics, using a local dict as guard against double regs. If, as described above,
we're called in a different sched group than the original one however, this
can cause double registrations.

Fixed here by keeping a reference to creating scheduling group and using this, not
active one, when/if creating new metrics.

Closes #14294
2023-06-21 10:08:27 +03:00
Tomasz Grabiec
e110167a2a locator: Store node shard count in topology
Will be needed by tablet allocator.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
dd968e16bf service: topology: Extract topology updating to a lambda
Reduces code duplication.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
21198e8470 treewide: Replace dht::shard_of() uses with table::shard_of() / erm::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
fb0bdcec0c storage_proxy: Avoid multishard reader for tablets
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.

The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
10e05eec66 storage_proxy: Obtain shard from erm in the read path
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e48ec6fed3 db, storage_proxy: Drop mutation/frozen_mutation ::shard_of()
dht::shard_of() does not use the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
d4497a058e forward_service: Use table sharder
schema::get_sharder() does not return the correct sharder for tablet-based tables.
Code which is supposed to work with all kinds of tables should use erm::get_sharder().
2023-06-21 00:58:24 +02:00
Kamil Braun
643e69af89 Merge 'Cluster features on raft: add storage for supported and enabled features' from Piotr Dulikowski
This PR implements the storage part of the cluster features on raft functionality, as described in the "Cluster features on raft v2" doc. These changes will be useful for later PRs that will implement the remaining parts of the feature.

Two new columns are added to `system.topology`:

- `supported_features set<text>` is a new clustering column which holds the features that given node advertises as supported. It will be first initialized when the node joins the cluster, and then updated every time the node reboots and its supported features set changes.
- `enabled_features set<text>` is a new static column which holds the features that are considered enabled by the cluster. Unlike in the current gossip-based implementation the features will not be enabled implicitly when all nodes support a feature, but rather via an explicit action of the topology coordinator.

These columns are reflected in the `topology_state_machine` structure and are populated when the topology state is loaded. Appropriate methods are added to the `topology_mutation_builder` and `topology_node_mutation_builder` in order to allow setting/modifying those columns.

During startup, nodes update their corresponding `supported_features` column to reflect their current feature set. For now it is done unconditionally, but in the future appropriate checks will be added which will prevent nodes from joining / starting their server for group 0 if they can't guarantee that they support all enabled features.

Closes #14232

* github.com:scylladb/scylladb:
  storage_service: update supported cluster features in group0 on start
  storage_service: add methods for features to topology mutation builder
  storage_service: use explicit ::set overload instead of a template
  storage_service: reimplement mutation builder setters
  storage_service: introduce topology_mutation_builder_base
  topology_state_machine: include information about features
  system_keyspace: introduce deserialize_set_column
  db/system_keyspace: add storage for cluster features managed in group 0
2023-06-20 18:32:00 +02:00
Piotr Dulikowski
3e955945de storage_service: update supported cluster features in group0 on start
Now, when a node starts, it will update its `supported_features` row in
`system.topology` via `update_topology_with_local_metadata`.

At this point, the functionality behind cluster features on raft is
mostly incomplete and the state of the `supported_features` column does
not influence anything so it's safe to update this column
unconditionally. In the future, the node will only join / start group0
server if it is sure that it supports all enabled features and it can
safely update the `supported_features` parameter.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
707e929831 storage_service: add methods for features to topology mutation builder
The newly added `supported_features` and `enabled_features` columns can
now be modified via topology mutation builders:

- `supported_features` can now be overwritten via a new overload of
  `topology_node_mutation_builder::set`.
- `enabled_features` can now be extended (i.e. more elements can be
  added to it) via `topology_mutation_builder::add_enabled_features`. As
  the set of enabled features only grows, this should be sufficient.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
2a4462a01f storage_service: use explicit ::set overload instead of a template
The `topology_node_mutation_builder::set` function has an overload which
accepts any type which can be converted to string via `::format`. Its
presence can lead to easy mistakes which can only be detected at runtime
rather at compile time. A concrete example: I wrote a function that
accepts an std::set<S> where S is convertible to sstring; it turns out
that std::string_view is not std::convertible_to sstring and overload
resolution falled back to the catch-all overload.

This commit gets rid of the catch-all overload and replaces it with
explicit ones. Fortunately, it was used for only two enums, so it wasn't
much work.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
a8aaeabfac storage_service: reimplement mutation builder setters
As promised in the previous commit which introduced
topology_mutation_builder_base, this commit adjusts existing setters of
topology mutation builder and topology node mutation builder to use
helper methods defined in the base class.

Note that the `::set` method for the unordered set of tokens now does
not delete the column in case an empty value is set, instead it just
writes an empty set. This semantic is arguably more clear given that we
have an explicit `::del` method and it shouldn't affect the existing
implementation - we never intentionally insert an empty set of tokens.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
ee12192125 storage_service: introduce topology_mutation_builder_base
Introduces `topology_mutation_builder_base` which will be a base class
for both topology mutation builder and topology node mutation builder.
Its purpose is to abstract away some detail about setting/deleting/etc.
column in the mutation, the actual topology (node) mutation builder will
only have to care about converting types and/or allowing only particular
columns to be set. The class is using CRTP: derived classes provide
access to the row being modified, schema and the timestamp.

For the sake of commit diff readability, this commt only introduces this
class and changes the builders to derive from it but no setter
implementations are modified - this will be done in the next commit.
2023-06-20 16:41:08 +02:00
Piotr Dulikowski
bc84d59665 topology_state_machine: include information about features
Now, the newly added `supported_features` and `enabled_features` columns
are reflected in the `topology_state_machine` structure.
2023-06-20 16:41:05 +02:00
Kamil Braun
63229e48e8 raft topology: ban left nodes 2023-06-20 13:03:46 +02:00
Kamil Braun
737c1b4ae6 raft topology: skip left_token_ring state during removenode
The "tell the node to shut down" RPC would fail every time in the
removenode path (since the node is dead), which is kind of awkward.

Besides, for removenode we don't really need the `left_token_ring`
state, we don't need to coordinate with the node - writes destined for
it are failing anyway (since it's dead) and we can ban the node
immediately.

Remove the node from group 0 while in `write_both_read_new` transition
state (even when we implement abort, in this state it's too late to
abort, we're committed to removing the node - so it's fine to remove it
from group 0 at this point).
2023-06-20 13:03:46 +02:00
Kamil Braun
977680773b raft topology: prepare decommission path for node banning
Currently the decommissioned node waits until it observes that it was
moved to the `left` state, then proceeds to leave group 0 and shut down.

Unfortunately, this strategy won't work once we introduce banning nodes
that are in `left` state - there is no guarantee that the
decommissioning node will observe that it entered `left` state. The
replication of Raft commands races with the ban propagating through the
cluster.

We also can't make the node leave as soon as it observes the
`left_token_ring` state, which would defeat the purpose of
`left_token_ring` - allowing all nodes to observe that the node has left
the token ring before it shuts down.

We could introduce yet another state between `left_token_ring` and
`left`, which the node waits for before shutting down; the coordinator
would request a barrier from the node before moving to `left` state.

The alternative - which we chose here - is to have the coordinator
explicitly tell the node to shutdown while we're in `left_token_ring`
through a direct RPC. We introduce
`raft_topology_cmd::command::shutdown` and send it to the node while in
`left_token_ring` state, after we requested a cluster barrier.

We don't require the RPC to succeed; we need to allow it to fail to
preserve availability. This is because an earlier incarnation of the
coordinator may have requested the node to shut down already, so the
new coordinator will fail the RPC as the node is already dead. This also
improves availability in general - if the node dies while we're in
`left_token_ring`, we can proceed.

We don't lose safety from that, since we'll ban the node (later commit).
We only lose a bit of user experience if there's a failure at this
decommission step - the decommissioning node may hang, never receiving
the RPC (it will be necessary to shut it down manually).

Another complication arising from banning the node is that it won't be
able to leave group 0 on its own; by the time it tries that, it may have
already been banned by the cluster (the coordinator moves the node to
`left` state after telling it to shut down). So we get rid of the
`leave_group0` step from `raft_decommission()` (which simplifies the
function too), putting a `remove_from_raft_config` inside the
coordinator code instead - after we told the node to shut down.
(Removing the node from configuration is also another reason why we need
to allow the above RPC to fail; the node won't be able to handle the
request once it's outside the configuration, because it handles all
coordinator requests by starting a read barrier.)

Finally, a complication arises when the coordinator is the
decommissioning node. The node would shut down in the middle of handling
the `left_token_ring` state, leading to harmless but awkward errors even
though there were no node/network failures (the original coordinator
would fail the `left_token_ring` state logic; a new coordinator would take
over and do it again, this time succeeding). We fix that by checking if
we're the decommissioning node at the beginning of `left_token_ring`
state handler, and if so, stepping down from leadership by becoming a
nonvoter first.
2023-06-20 13:03:46 +02:00
Kamil Braun
b8ddfd9ef9 raft topology: introduce left_token_ring state
We want for the decommissioning node to wait before shutting down until
every node learns that it left the token ring. Otherwise some nodes may
still try coordinating writes to that nodes after it already shut down,
leading to unnecessary failures on the data path(e.g. for CL=ALL writes).

Before this change, a node would shut down immediately after observing
that it was in `left` state; some other nodes may still see it in
`decommissioning` state and the topology transition state as
`write_both_read_new`, so they'd try to write to that node.

After this change, the node first enters the `left_token_ring` state
before entering `left`, while the topology transition state is removed
(so we've finished the token ring change - the node no longer has tokens
in the ring, but it's still part of the topology). There we perform a
read barrier, allowing all nodes to observe that the decommissioning
node has indeed left the token ring. Only after that barrier succeeds we
allow the node to shut down.
2023-06-20 13:03:46 +02:00
Kamil Braun
c94c07804d raft topology: raft_topology_cmd implicit constructor
Saves some redundant typing when passing `raft_topology_cmd` parameters,
so we can change this:
```
raft_topology_cmd{raft_topology_cmd::command::fence_old_reads}
```
into this:
```
raft_topology_cmd::command::fence_old_reads
```
2023-06-20 13:03:46 +02:00
Kamil Braun
8b152361f4 Merge 'raft topology: fixes after #13884' from Gusev Petr
This PR fixes some problems found after the PR was merged:
  * missed `node_to_work_on` assignment in `handle_topology_transition`;
  * change error reporting in `update_fence_version` from `on_internal_error` to regular exceptions, since that exceptions can happen during normal operation.
  * `update_fence_version` has beed moved  after `group0_service.setup_group0_if_exist` in `main.cc`, otherwise we use uninitialized `token_metadata::version` and get an error.

Fixes: #14303

Closes #14292

* github.com:scylladb/scylladb:
  main.cc: move update_fence_version after group0_service.setup_group0_if_exist
  shared_token_metadata: update_fence_version: on_internal_error -> throw
  storage_service: handle_topology_transition: fix missed node assignment
2023-06-20 13:02:17 +02:00
Kamil Braun
732feca115 storage_proxy: query_partition_key_range_concurrent: don't access empty range
`query_partition_range_concurrent` implements an optimization when
querying a token range that intersects multiple vnodes. Instead of
sending a query for each vnode separately, it sometimes sends a single
query to cover multiple vnodes - if the intersection of replica sets for
those vnodes is large enough to satisfy the CL and good enough in terms
of the heat metric. To check the latter condition, the code would take
the smallest heat metric of the intersected replica set and compare them
to smallest heat metrics of replica sets calculated separately for each
vnode.

Unfortunately, there was an edge case that the code didn't handle: the
intersected replica set might be empty and the code would access an
empty range.

This was catched by an assertion added in
8db1d75c6c by the dtest
`test_query_dc_with_rf_0_does_not_crash_db`.

The fix is simple: check if the intersected set is empty - if so, don't
calculate the heat metrics because we can decide early that the
optimization doesn't apply.

Also change the `assert` to `on_internal_error`.

Fixes #14284

Closes #14300
2023-06-20 07:56:40 +03:00
Kamil Braun
aa2ccb3ac4 Merge 'raft topology: wait_for_peers_to_enter_synchronize_state doesn't need to resolve all IPs' from Mikołaj Grzebieluch
Another node can stop after it joined the group0 but before it
advertised itself in gossip. `get_inet_addrs` will try to resolve all
IPs and `wait_for_peers_to_enter_synchronize_state` will loop
indefinitely.

But `wait_for_peers_to_enter_synchronize_state` can return early if one
of the nodes confirms that the upgrade procedure has finished. For that,
it doesn't need the IPs of all group 0 members - only the IP of some
nodes which can do the confirmation.

This pr restructures the code so that IPs of nodes are resolved inside
the `max_concurrent_for_each` that
`wait_for_peers_to_enter_synchronize_state` performs. Then, even if some
IPs won't be resolved, but one of the nodes confirms a successful
upgrade, we can continue.

Fixes #13543

Closes #14046

* github.com:scylladb/scylladb:
  raft topology: test: check if aborted node replacing blocks bootstrap
  raft topology: `wait_for_peers_to_enter_synchronize_state` doesn't need to resolve all IPs
2023-06-19 12:31:27 +02:00
Petr Gusev
1770feebda storage_service: handle_topology_transition: fix missed node assignment
This defect remained after the refactoring of
exec_global_command in #13884.
2023-06-19 11:26:57 +04:00