Commit Graph

652 Commits

Author SHA1 Message Date
Benny Halevy
a1acf6854b everywhere: reduce dependencies on i_partitioner.hh
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:47:44 +02:00
Benny Halevy
6de1cc2993 locator: resolve the dependency of token_metadata.hh on token_range_splitter.hh
define token_metadata_ptr in token_metadata_fwd.hh
So that the declaration of `make_splitter` can be moved
to token_range_splitter.hh, where it belongs,
and so token_metadata.hh won't have to include it.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:01:29 +02:00
Tomasz Grabiec
d5539e080d tablets: Implement cleanup step
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.

The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.

This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
2023-09-14 12:45:10 +02:00
Tomasz Grabiec
6a62aca3a9 locator: Introduce tablet_metadata_guard
Will be used to synchronize long-running tablet operations with
topology coordinator.

It blocks barriers like erm_ptr, but refreshes if change is
irrelevant, so behaves as if the erm_ptr's scope was narrowed down to
a single tablet.
2023-09-14 12:45:10 +02:00
Tomasz Grabiec
532ec84210 locator, replica: Add a way to wait for table's effective_replication_map change 2023-09-14 12:08:54 +02:00
Benny Halevy
7119c1d8cc token_metadata: update_topology: make endpoint_dc_rack arg optional
It's better to pass a disengaged optional when
the caller doesn't have the information rather than
passing the default dc_rack location so the latter
will never implicitly override a known endpoint dc/rack location.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15300
2023-09-11 16:16:19 +02:00
Botond Dénes
b062b245ad Merge 'Don't cache dc:rack on system keyspace local cache' from Pavel Emelyanov
The local node's dc:rack pair is cached on system keyspace on start. However, most of other code don't need it as they get dc:rack from topology or directly from snitch. There are few places left that still mess with sysks cache, but they are easy to patch. So after this patch all the core code uses two sources of dc:rack -- topology / snitch -- instead of three.

Closes #15280

* github.com:scylladb/scylladb:
  system_keyspace: Don't require snitch argument on start
  system_keyspace: Don't cache local dc:rack pair
  system_keyspace: Save local info with explicit location
  storage_service: Get endpoint location from snitch, not system keyspace
  snitch: Introduce and use get_location() method
  repair: Local location variables instead of system keyspace's one
  repair: Use full endpoint location instead of datacenter part
2023-09-11 10:26:26 +03:00
Benny Halevy
574c7e349a locator: topology: is_configured_this_node: delete spurious semicolumn
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-06 12:24:09 +03:00
Benny Halevy
115462be17 locator: topology: is_configured_this_node: compare host_id first
Since 5d1f60439a we have
this node's host_id in topology config, so it can be used
to determine this node when adding it.

Prepare for extending the token_metadata interface
to provide host_id in update_topology.

We would like to compare the host_id first to be able to distinguish
this node from a node we're replacing that may have the same ip address
(but different host_id).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-06 12:24:09 +03:00
Pavel Emelyanov
d2bd203cba snitch: Introduce and use get_location() method
There are some places out there that generate locator::endpoint_dc_rack
pair out of snitch's get_datacenter() and get_rack() calls. Generalize
those with snitch's new method. It will also be used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-05 12:52:30 +03:00
Benny Halevy
5afc242814 token_metadata: get_endpoint_to_host_id_map_for_reading: just inform that normal node has null host_id
It is too early to require that all nodes in normal state
have a non-null host_id.

The assertion was added in 44c14f3e2b
but unfortunately there are several call sites where
we add the node as normal, but without a host_id
and we patch it in later on.

In the future we should be able to require that
once we identify nodes by host_id over gossiper
and in token_metadata.

Fixes scylladb/scylladb#15181

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #15184
2023-08-28 21:40:55 +03:00
Botond Dénes
e7af2a7de8 Merge 'token_metadata::get_endpoint_to_host_id_map_for_reading: restrict to token owners' from Benny Halevy
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.

Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793

Closes #14844

* github.com:scylladb/scylladb:
  api: storage_service: improve description of /storage_service/host_id
  token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
2023-08-23 13:55:14 +03:00
Kamil Braun
cdc3cd2b79 Merge 'raft: add fencing tests' from Petr Gusev
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.

The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.

Closes #14901

* github.com:scylladb/scylladb:
  test_fencing: add test_fence_hints
  test.py: output the skipped tests
  test.py: add skip_mode decorator and fixture
  test.py: add mode fixture
  hints: add debug log for dropped hints
  hints: send_one_hint: extend the scope of file_send_gate holder
  pylib: add ScyllaMetrics
  hints manager: add send_errors counter
  token_metadata: add debug logs
  fencing: add simple data plane test
  random_tables.py: add counter column type
  raft topology: don't increment version when transitioning to node_state::normal
2023-08-22 16:28:21 +02:00
Petr Gusev
fa25e6d63e token_metadata: add debug logs
We log the new version when the new token
metadata is set.

Also, the log for fence_version is moved
in shared_token_metadata from storage_service
for uniformity.
2023-08-22 14:31:04 +04:00
Benny Halevy
44c14f3e2b token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.

Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-08-21 09:16:42 +03:00
Tomasz Grabiec
bd8bb5d4b1 Merge 'Wire tablet into compaction group' from Raphael "Raph" Carvalho
Compaction group is the data plane for tablets, so this integration
allows each tablet to have its own storage (memtable + sstables).
A crucial step for dynamic tablets, where each tablet can be worked
on independently.

There are still some inefficiencies to be worked on, but as it is,
it already unlocks further development.

```
INFO  2023-07-27 22:43:38,331 [shard 0] init - loading tablet metadata
INFO  2023-07-27 22:43:38,333 [shard 0] init - loading non-system sstables
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 0 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 2 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 4 present for ks.cf
INFO  2023-07-27 22:43:38,354 [shard 0] table - Tablet with id 6 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 1 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 3 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 5 present for ks.cf
INFO  2023-07-27 22:43:38,428 [shard 1] table - Tablet with id 7 present for ks.cf
```

Closes #14863

* github.com:scylladb/scylladb:
  Kill scylla option to configure number of compaction groups
  replica: Wire tablet into compaction group
  token_metadata: Add this_host_id to topology config
  replica: Switch to chunked_vector for storing compaction groups
  replica: Generate group_id for compaction_group on demand
2023-08-18 15:17:17 +02:00
Raphael S. Carvalho
5d1f60439a token_metadata: Add this_host_id to topology config
The motivation is that token_metadata::get_my_id() is not available
early in the bootstrap process, as raft topology is pulled later
than new tables are registered and created, and this node is added
to topology even later.

To allow creation of compaction groups to retrieve "my id" from
token metadata early, initialization will now feed local id
into topology config which is immutable for each node anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-08-16 18:23:44 -03:00
Raphael S. Carvalho
9400b79658 gce_snitch: Fix use-after-move in load_config()
The use-after-move is not very harmful as it's only used when
handling exception. So user would be left with a bogus message.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #15054
2023-08-15 10:23:57 +03:00
Benny Halevy
949ea43034 topology: unindex_node: erase dc from datacenters when empty
In branch 5.2 we erase `dc` from `_datacenters` if there are
no more endpoints listed in `_dc_endpoints[dc]`.

This was lost unintentionally in f3d5df5448
and this commit restores that behavior, and fixes test_remove_endpoint.

Fixes scylladb/scylladb#14896

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14897
2023-08-02 09:08:24 +03:00
Avi Kivity
dac93b2096 Merge 'Concurrent tablet migration and balancing' from Tomasz Grabiec
This change makes tablet load balancing more efficient by performing
migrations independently for different tablets, and making new load
balancing plans concurrently with active migrations.

The migration track is interrupted by pending topology change operations.

The coordinator executes the load balancer on edges of tablet state
machine transitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.

Overload of shards is limited by the fact that load balancer algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

Closes #14851

* github.com:scylladb/scylladb:
  tablets: load_balancer: Remove double logging
  tests: tablets: Check that load balancing is interrupted by topology change
  tests: tablets: Add test for load balancing with active migrations
  tablets: Balance tablets concurrently with active migrations
  storage_service, tablets: Extract generate_migration_updates()
  storage_service, tablets: Move get_leaving_replica() to tablets.cc
  locator: tablets: Move std::hash definition earlier
  storage_service: Advance tablets independently
  topology_coordinator: Fix missed notification on abort
  tablets: Add formatter for tablet_migration_info
2023-07-31 16:44:33 +03:00
Benny Halevy
d903d03bf8 locator: topology: node::state: make fine grained
Currently the node::state is coarse grained
so one cannot distinguish between e.g. a leaving
node due to decommission (where the node is used
for reading) vs. due to remove node (where the
node is not used for reading).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-07-31 10:33:48 +03:00
Tomasz Grabiec
8fdbc42e71 tests: tablets: Add test for load balancing with active migrations 2023-07-31 01:45:23 +02:00
Tomasz Grabiec
fe181b3bac tablets: Balance tablets concurrently with active migrations
After this change, the load balancer can make progress with active
migrations. If the algorithm is called with active tablet migrations
in tablet metadata, those are treated by load balancer as if they were
already completed. This allows the algorithm to incrementally make
decision which when executed with active migrations will produce the
desired result.

Overload of shards is limited by the fact that the algorithm tracks
streaming concurrency on both source and target shards of active
migrations and takes concurrency limit into account when producing new
migrations.

The coordinator executes the load balancer on edges of tablet state
machine stransitions. This allows new migrations to be started as soon
as tablets finish streaming.

The load balancer is also continuously invoked as long as it produces
a non-empty plan. This is in order to saturate the cluster with
streaming. A single make_plan() call is still not saturating, due
to the way algorithm is implemented.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
fbc6076e6a storage_service, tablets: Move get_leaving_replica() to tablets.cc
For better encapsulation of tablet-specific code.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
18a59ab5ff locator: tablets: Move std::hash definition earlier
Will be needed in order to define a struct which has
unordered_set<tablet_replica> as a field.
2023-07-31 01:45:23 +02:00
Tomasz Grabiec
6f4a35f9ae service: tablet_allocator: Introduce tablet load balancer
Will be invoked by the topology coordinator later to decide
which tablets to migrate.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
d59b8d316c tablets: Introduce tablet_map::for_each_tablet() 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
0e3eac29d0 topology: Introduce get_node() 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f2fdf37415 token_metadata: Add non-const getter of tablet_metadata
Needed for tests.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
e3a8bb7ec9 tablets: Introduce global_tablet_id
Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.

Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
8cf92d4c86 tablets: Turn tablet_id into a struct
The IDL compiler cannot deal with enum classes like this.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
91dee5c872 tablets: effective_replication_map: Take transition stage into account when computing replicas 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
dc2ec3f81c tablets: Store "stage" in transition info
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.

See the "Tablet migration" section of topology-over-raft.md
2023-07-25 21:08:02 +02:00
Tomasz Grabiec
7851694eaa locator: erm: Make get_endpoints_for_reading() always return read replicas
Just a simplification.

Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
2023-07-25 21:08:01 +02:00
Kefu Chai
3129ae3c8c treewide: compare signed and unsigned using std::cmp_*()
when comparing signed and unsigned numbers, the compiler promotes
the signed number to coomon type -- in this case, the unsigned type,
so they can be compared. but sometimes, it matters. and after the
promotion, the comparison yields the wrong result. this can be
manifested using a short sample like:

```
int main(int argc, char **argv) {
    int x = -1;
    unsigned y = 2;
    fmt::print("{}\n", x < y);
    return 0;
}
```

this error can be identified by `-Werror=sign-compare`, but before
enabling this compiling option. let's use `std::cmp_*()` to compare
them.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 10:27:18 +08:00
Patryk Jędrzejczak
7ae7be0911 locator: remove this_host_id from topology::config
The `locator::topology::config::this_host_id` field is redundant
in all places that use `locator::topology::config`, so we can
safely remove it.

Closes #14638

Closes #14723
2023-07-17 14:57:36 +02:00
Petr Gusev
3737bf8fa2 topology.cc: unindex_node: _dc_racks removal fix
The eps reference was reused to manipulate
the racks dictionary. This resulted in
assigning a set of nodes from the racks
dictionary to an element of the _dc_endpoints dictionary.

The problem was demonstrated by the dtest
test_decommission_last_node_in_rack
(scylladb/scylla-dtest#3299).
The test set up four nodes, three on one rack
and one on another, all within a single data
center (dc). It then switched to a
'network_topology_strategy' for one keyspace
and tried to decommission the single node
on the second rack. This decomission command
with error message 'zero replica after the removal.'
This happened because unindex_node assigned
the empty list from the second rack
as a value for the single dc in
_dc_endpoints dictionary. As a result,
we got empty nodes list for single dc in
natural_endpoints_tracker::_all_endpoints,
node_count == 0 in data_center_endpoints,
_rf_left == 0, so
network_topology_strategy::calculate_natural_endpoints
rejected all the endpoints and returned an empty
endpoint_set. In
repair_service::do_decommission_removenode_with_repair
this caused the 'zero replica after the removal' error.

With this fix the test passes both with
--consistent-cluster-management option and
without it.

The specific unit test for this problem was added.

Fixes: #14184

Closes #14673
2023-07-13 11:16:01 +03:00
Avi Kivity
d645e7a515 Update seastar submodule
locator/*_snitch.cc updated for http::reply losing the _status_code
member without a deprecation notice.

* seastar 99d28ff057...2b7a341210 (23):
  > Merge 'Prefault memory when --lock-memory 1 is specified' from Avi Kivity
Fixes #8828.
  > reactor: use structured binding when appropriate
  > Simplify payload length and mask parsing.
  > memcached: do not used deprecated API
  > build: serialize calls to openssl certificate generation
  > reactor: epoll backend: initialize _highres_timer_pending
  > shared_ptr: deprecate lw_shared_ptr operator=(T&&)
  > tests: fail spawn_test if output is empty
  > Support specifying the "build root" in configure
  > Merge 'Cleanup RPC request/response frames maintenance' from Pavel Emelyanov
  > build: correct the syntax error in comment
  > util: print_safe: fix hex print functions
  > Add code examples for handling exceptions
  > smp: warn if --memory parameter is not supported
  > Merge 'gate: track holders' from Benny Halevy
  > file: call lambda with std::invoke()
  > deleter: Delete move and copy constructors
  > file: fix the indent
  > file: call close() without the syscall thread
  > reactor: use s/::free()/::io_uring_free_probe()/
  > Merge 'seastar-json2code: generate better-formatted code' from Kefu Chai
  > reactor: Don't re-evaliate local reactor for thread_pool
  > Merge 'Improve http::reply re-allocations and copying in client' from Pavel Emelyanov

Closes #14602
2023-07-10 16:07:12 +03:00
Marcin Maliszkiewicz
c5de25be4c locator: use deferred_close in azure and gcp snitches
Close needs to be called even if function throws in the middle.

Closes #14458
2023-07-07 11:08:10 +02:00
Tomasz Grabiec
ebdebb982b locator: network_topology_startegy: Allocate shards to tablets
Uses a simple algorihtm for allocating shards which chooses
least-loaded shard on a given node, encapsulated in load_sketch.

Takes load due to current tablet allocation into account.

Each tablet, new or allocated for other tables, is assumed to have an
equal load weight.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
e110167a2a locator: Store node shard count in topology
Will be needed by tablet allocator.
2023-06-21 00:58:25 +02:00
Tomasz Grabiec
353ce1a6d1 locator: Make sharder accessible through effective_replication_map
For tablets, sharding depends on replication map, so the scope of the
sharder should be effective_replicaion_map rather than the schema
object.

Existing users will be transitioned incrementally in later patches.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
22ab100b41 tablets: Implement tablet sharder 2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e44e6033d8 tablets: Include pending replica in get_shard()
We need to move get_shard() from tablet_info to tablet_map in order to
have access to transition_info.
2023-06-21 00:58:24 +02:00
Petr Gusev
246eaec14e shared_token_metadata: update_fence_version: on_internal_error -> throw
on_internal_error is wrong for fence_version
condition violation, since in case of
topology change coordinator migrating to another
node we can have raft_topology_cmd::command::fence
command from the old coordinator running in
parallel with the fence command (or topology version
upgrading raft command) from the new one.
The comment near the raft_topology_cmd::command::fence
handling describes this situation, assuming an exception
is thrown in this case.
2023-06-20 13:39:17 +04:00
Petr Gusev
f6b019c229 raft topology: add fence_version
It's stored outside of topology table,
since it's updated not through RAFT, but
with a new 'fence' raft command.
The current value is cached in shared_token_metadata.
An initial fence version is loaded in main
during storage_service initialisation.
2023-06-15 15:48:00 +04:00
Petr Gusev
4f99302c2b raft_topology: add barrier_and_drain cmd
We use utils::phased_barrier. The new phase
is started each time the version is updated.
We track all instances of token_metadata,
when an instance is destroyed the
corresponding phased_barrier::operation is
released.
2023-06-15 15:48:00 +04:00
Petr Gusev
253d8a8c65 token_metadata: add topology version
It's stored in as a static column in topology table,
will be updated at various steps of the topology
change state machine.

The initial value is 1, zero means that topology
versions are not yet supported, will be
used in RPC handling.
2023-06-15 15:48:00 +04:00
Avi Kivity
26c8470f65 treewide: use #include <seastar/...> for seastar headers
We treat Seastar as an external library, so fix the few places
that didn't do so to use angle brackets.

Closes #14037
2023-06-06 08:36:09 +03:00
Petr Gusev
819d710753 vnode_erm: optimize get_range_addresses
In get_range_addresses we are iterating
over vnode tokens, don't need to do
binary search for them in tmptr->first_token,
they can be directly used as keys
for _replication_map.
2023-05-24 12:16:37 +04:00