Since a given tablet belongs to a single shard on both repair master and repair
followers, row level repair code needs to be changed to work on a single
shard for a given tablet. In order to tell the repair followers which
shard to work on, a dst_cpu_id value is passed over rpc from the repair
master.
Tablet streaming involves asynchronous RPCs to other replicas which transfer writes. We want side-effects from streaming only within the migration stage in which the streaming was started. This is currently not guaranteed on failure. When streaming master fails (e.g. due to RPC failing), it can be that some streaming work is still alive somewhere (e.g. RPC on wire) and will have side-effects at some point later.
This PR implements tracking of all operations involved in streaming which may have side-effects, which allows the topology change coordinator to fence them and wait for them to complete if they were already admitted.
The tracking and fencing is implemented by using global "sessions", created for streaming of a single tablet. Session is globally identified by UUID. The identifier is assigned by the topology change coordinator, and stored in system.tablets. Sessions are created and closed based on group0 state (tablet metadata) by the barrier command sent to each replica, which we already do on transitions between stages. Also, each barrier waits for sessions which have been closed to be drained.
The barrier is blocked only if there is some session with work which was left behind by unsuccessful streaming. In which case it should not be blocked for long, because streaming process checks often if the guard was left behind and stops if it was.
This mechanism of tracking is fault-tolerant: session id is stored in group0, so coordinator can make progress on failover. The barriers guarantee that session exists on all replicas, and that it will be closed on all replicas.
Closesscylladb/scylladb#15847
* github.com:scylladb/scylladb:
test: tablets: Add test for failed streaming being fenced away
error_injection: Introduce poll_for_message()
error_injection: Make is_enabled() public
api: Add API to kill connection to a particular host
range_streamer: Do not block topology change barriers around streaming
range_streamer, tablets: Do not keep token metadata around streaming
tablets: Fail gracefully when migrating tablet has no pending replica
storage_service, api: Add API to disable tablet balancing
storage_service, api: Add API to migrate a tablet
storage_service, raft topology: Run streaming under session topology guard
storage_service, tablets: Use session to guard tablet streaming
tablets: Add per-tablet session id field to tablet metadata
service: range_streamer: Propagate topology_guard to receivers
streaming: Always close the rpc::sink
storage_service: Introduce concept of a topology_guard
storage_service: Introduce session concept
tablets: Fix topology_metadata_guard holding on to the old erm
docs: Document the topology_guard mechanism
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Recent seastar update included RPC metrics (scylladb/seastar#1753). The
reported metrics groups together sockets based on their "metrics_domain"
configuration option. This patch makes use of this domain to make scylla
metrics sane.
The domain as this patch defines it includes two strings:
First, the datacenter the server lives in. This is because grouping
metrics for connections to different datacenters makes little sense for
several reasons. For example -- packet delays _will_ differ for local-DC
vs cross-DC traffic and mixing those latencies together is pointless.
Another example -- the amount of traffic may also differ for local- vs
cross-DC connections e.g. because of different usage of enryption and/or
compression.
Second, each verb-idx gets its own domain. That's to be able to analyze
e.g. query-related traffic from gossiper one. For that the existing
isolation cookie is taken as is.
Note, that the metrics is _not_ per-server node. So e.g. two gossiper
connections to two different nodes (in one DC) will belong to the same
domain and thus their stats will be summed when reported.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#15785
to have feature parity with `configure.py`. we won't need this
once we migrate to C++20 modules. but before that day comes, we
need to stick with C++ headers.
we generate a rule for each .hh files to create a corresponding
.cc and then compile it, in order to verify the self-containness of
that header. so the number of rule is quite large, to avoid the
unnecessary overhead. the check-header target is enabled only if
`Scylla_CHECK_HEADERS` option is enabled.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15913
It is possible the sender and receiver of streaming nodes have different
views on if a table is dropped or not.
For example:
- n1, n2 and n3 in the cluster
- n4 started to join the cluster and stream data from n1, n2, n3
- a table was dropped
- n4 failed to write data from n2 to sstable because a table was dropped
- n4 ended the streaming
- n2 checked if the table was present and would ignore the error if the table was dropped
- however n2 found the table was still present and was not dropped
- n2 marked the streaming as failed
This will fail the streaming when a table is dropped. We want streaming to
ignore such dropped tables.
In this patch, a status code is sent back to the sender to notify the
table is dropped so the sender could ignore the dropped table.
Fixes#15370Closesscylladb/scylladb#15912
messaging_service.cc depends on idl, but many source files in
scylla-main do no depend on idl, so let's
* move "message/*" into its own directory and add an inter-library
dependency between it and the "idl" library.
* rename the target of "message" under test/manual to "message_test"
to avoid the name collision
this should address the compilation failure of
```
FAILED: CMakeFiles/scylla-main.dir/message/messaging_service.cc.o
/usr/bin/clang++ -DBOOST_NO_CXX98_FUNCTION_BASE -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_BROKEN_SOURCE_LOCATION -DSEASTAR_DEBUG -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/cmake/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/cmake/seastar/gen/include -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-mismatched-tags -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-ignored-qualifiers -march=westmere -Og -g -gz -std=gnu++20 -fvisibility=hidden -U_FORTIFY_SOURCE -Wno-error=unused-result "-Wno-error=#warnings" -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -MD -MT CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -MF CMakeFiles/scylla-main.dir/message/messaging_service.cc.o.d -o CMakeFiles/scylla-main.dir/message/messaging_service.cc.o -c /home/kefu/dev/scylladb/message/messaging_service.cc
/home/kefu/dev/scylladb/message/messaging_service.cc:81:10: fatal error: 'idl/join_node.dist.hh' file not found
^~~~~~~~~~~~~~~~~~~~~~~
```
where the compiler failed to find the included `idl/join_node.dist.hh`,
which is exposed by the idl library as part of its public interface.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#15657
The `join_node_request` and `join_node_response` RPCs are added:
- `join_node_request` is sent from the joining node to any node in the
cluster. It contains some initial parameters that will be verified by
the receiving node, or the topology coordinator - notably, it contains
a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
joining node to tell it about the the outcome of the verification.
Move node_ops related classes to node_ops/ so that they
are consistently grouped and could be access from
many modules.
Closes#15351
* github.com:scylladb/scylladb:
node_ops: extract classes related to node operations
node_ops: repair: move node_ops_id to node_ops directory
This change adds a stub for tablet cleanup on the replica side and wires
it into the tablet migration process.
The handling on replica side is incomplete because it doesn't remove
the actual data yet. It only flushes the memtables, so that all data
is in sstables and none requires a memtable flush.
This patch is necessary to make decommission work. Otherwise, a
memtable flush would happen when the decommissioned node is put in the
drained state (as in nodetool drain) and it would fail on missing host
id mapping (node is no longer in topology), which is examined by the
tablet sharder when producing sstable sharding metadata. Leading to
abort due to failed memtable flush.
Node operations will be integrated with task manager and so node_ops
directory needs to be created. To have an access to node ops related
classes from task manager and preserve consistent naming, move
the classes to node_ops/node_ops_data.cc.
Use an abort_source in group0_state_machine
to abort an ongoing transfer_snapshot operation
on group0_state_machine::abort()
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Using a centrally generated compaction-time, generated on the repair
master and propagated to all repair followers. For repair it is
imperative that all participants use the exact same compaction time,
otherwise there can be artificial differences between participants,
generating unnecessary repair activity.
If a repair follower doesn't get a compaction-time from the repair
master, it uses a locally generated one. This is no worse than the
previous state of each node being on some undefined state of compaction.
This is to make m.s. initialization more solid and simplify sys.ks.::setup()
Closes#14832
* github.com:scylladb/scylladb:
system_keyspace: Remove unused snitch arg from setup()
messaging_service: Setup preferred IPs from config
Population of messageing service preferred IPs cache happens inside
system keyspace setup() call and it needs m.s. per ce and additionally
snitch. Moving preferred ip cache to initial configuration keeps m.s.
start more self-contained and keeps system_keyspace::setup() simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When messaging_service shuts down it first sets _shutting_down to true
and proceeds with stopping clients and servers. Stopping clients, in
turn, is calling client.stop() on each.
Setting _shutting_down is used in two places.
First, when a client is stopped it may happen that it's in the middle of
some operation, which may result in call to remove_error_rpc_client()
and not to call .stop() for the second time it just does nothing if the
shutdown flag is set (see 357c91a076).
Second, get_rpc_client() asserts that this flag is not set, so once
shutdown started it can make sure that it will call .stop() on _all_
clients and no new ones would appear in parallel.
However, after shutdown() is complete the _clients vector of maps
remains intact even though all clients from it are stopped. This is not
very debugging-friendly, the clients are better be removed on shutdown.
fixes: #14624
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#14632
Fixes#14299
failure_detector can try sending messages to TLS endpoints before start_listen
has been called (why?). Need TLS initialized before this. So do on service creation.
Closes#14493
`handle_state_normal` may drop connections to the handled node. This
causes spurious failures if there's an ongoing concurrent operation.
This problem was already solved twice in the past in different contexts:
first in 53636167ca, then in
79ee38181c.
Time to fix it for the third time. Now we do this right after enabling
gossiping, so hopefully it's the last time.
This time it's causing snapshot transfer failures in group 0. Although
the transfer is retried and eventually succeeds, the failed transfer is
wasted work and causes an annoying ERROR message in the log which
dtests, SCT, and I don't like.
The fix is done by moving the `wait_for_normal_state_handled_on_boot()`
call before `setup_group0()`. But for the wait to work correctly we must
first ensure that gossiper sees an alive node, so we precede it with
`wait_for_live_node_to_show_up()` (before this commit, the call site of
`wait_for_normal_state_handled_on_boot` was already after this wait).
There is another problem: the bootstrap procedure is racing with gossiper
marking nodes as UP, and waiting for other nodes to be NORMAL doesn't guarantee
that they are also UP. If gossiper is quick enough, everything will be fine.
If not, problems may arise such as streaming or repair failing due to nodes
still being marked as DOWN, or the CDC generation write failing.
In general, we need all NORMAL nodes to be up for bootstrap to proceed.
One exception is replace where we ignore the replaced node. The
`sync_nodes` set constructed for `wait_for_normal_state_handled_on_boot`
takes this into account, so we also use it to wait for nodes to be UP.
As explained in commit messages and comments, we only do these
waits outside raft-based-topology mode.
This should improve CI stability.
Fixes: #12972
Refs: #14042Closes#14354
* github.com:scylladb/scylladb:
messaging_service: print which connections are dropped due to missing topology info
storage_service: wait for nodes to be UP on bootstrap
storage_service: wait for NORMAL state handler before `setup_group0()`
storage_service: extract `gossiper::wait_for_live_nodes_to_show_up()`
This connection dropping caused us to spend a lot of time debugging.
Those debugging sessions would be shorter if Scylla logs indicated that
connections are being dropped and why.
Connection drops for a given node are a one-time event - we only do it
if we establish a connection to a node without topology info, which
should only happen before we handle the node's NORMAL status for the
first time. So it's a rare thing and we can log it on INFO level without
worrying about log spam.
Calling `ban_host` causes the following:
- all connections from that host are dropped,
- any further attempts to connect will be rejected (the connection will
be immediately dropped) when receiving the `CLIENT_ID` verb.
When a node first establishes a connection to another node, it always
sending a `CLIENT_ID` one-way RPC first. The message contains some
metadata such as `broadcast_address`.
Include the `host_id` of the sender in that RPC. On the receiving side,
store a mapping from that `host_id` to the connection that was just
opened.
This mapping will be used later when we ban nodes that we remove from
the cluster.
The RPC server now has a lighter .shutdown() method that just does what
m.s. shutdown() needs, so call it. On stop call regular stop to finalize
the stopping process
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it do_with_servers() and make it accept method to call and message
to print. This gives the ability to reuse this helper in next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some assorted cleanups here: consolidation of schema agreement waiting
into a single place and removing unused code from the gossiper.
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/1458/
Reviewed-by: Konstantin Osipov <kostja@scylladb.com>
* gleb/gossiper-cleanups of github.com:scylladb/scylla-dev:
storage_service: avoid unneeded copies in on_change
storage_service: remove check that is always true
storage_service: rename handle_state_removing to handle_state_removed
storage_service: avoid string copy
storage_service: delete code that handled REMOVING_TOKENS state
gossiper: remove code related to advertising REMOVING_TOKEN state
migration_manager: add wait_for_schema_agreement() function
On connection setup, the isolation cookie of the connection is matched
to the appropriate scheduling group. This is achieved by iterating over
the known statement tenant connection types as well as the system
connections and choosing the one with a matching name.
If a match is not found, it is assumed that the cluster is upgraded and
the remote node has a scheduling group the local one doesn't have. To
avoid demoting a scheduling group of unknown importance, in this case the
default scheduling group is chosen.
This is problematic when upgrading an OSS cluster to an enterprise
version, as the scheduling groups of the enterprise service-levels will
match none of the statement tenants and will hence fall-back to the
default scheduling group. As a consequence, while the cluster is mixed,
user workload on old (OSS) nodes, will be executed under the system
scheduling group and concurrency semaphore. Not only does this mean that
user workloads are directly competing for resources with system ones,
but the two workloads are now sharing the semaphore too, reducing the
available throughput. This usually manifests in queries timing out on
the old (OSS) nodes in the cluster.
This patch proposes to fix this, by recognizing that the unknown
scheduling group is in fact a tenant this node doesn't know yet, and
matching it with the default statement tenant.
With this, order should be restored, with service-level connections
being recognized as user connections and being executed in the statement
scheduling group and the statement (user) concurrency semaphore.
We have a set amount of connection types for each tenant. The amount of
these connection types can change. Although currently these are
hardcoded in a single place, soon (in the next patch) there will be yet
another place where these will be used. To avoid duplicating these
names, making future changes error prone, centralize them in a const
array, generalizing the concept of a tenant connection type.
Make sure that the int64_t generation we get over rpc
fits in the int32_t generation_type we keep locally.
Restrict this assertion to non-release builds.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The legacy failure_detector is now unused and can be removed.
TODO: integare direct_failure_detector with failure_detector api.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
the goal of this change is to reduce the dependency on
`operator<<(ostream&, const gms::inet_address&)`.
this is not an exhaustive search-and-replace change, as in some
caller sites we have other dependencies to yet-converted ostream
printer, we cannot fix them all, this change only updates some
caller of `operator<<(ostream&, const gms::inet_address&)`.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Empty for now. Will be used later by the topology coordinator to
communicate with other nodes to instruct them to start streaming,
or start to fence read/writes.
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
We already check is remote's node topology is missing before creating a
connection, but local node topology can be missing too when we will use
raft to manage it. Raft needs to be able to create connections before
topology is knows.
Message-Id: <20221228144944.3299711-7-gleb@scylladb.com>