Now the user can do
CREATE KEYSPACE ... WITH TABLETS = { 'enabled': false }
to turn tablets off. It will be useful in the future to opt-out keyspace
from tablets when they will be turned on by default based on cluster
features only.
Also one can do just
CREATE KEYSPACE ... WITH TABLETS = { 'enabled': true }
and let Scylla select the initial tablets value by its own
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch changes the syntax of enabling tablets from
CREATE KEYSPACE ... WITH REPLICATION = { ..., 'initial_tablets': <int> }
to be
CREATE KEYSPACE ... WITH TABLETS = { 'initial': <int> }
and updates all tests accordingly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If user configured zero initial tablets (spoiler: or this value was set
automagically when enabling tablets begind the scenes) we still need
some value to start with and this patch calculates one.
The math is based on topology and RF so that all shards are covered:
initial_tablets = max(nr_shards_in(dc) / RF_in(dc) for dc in datacenters)
The estimation is done when a table is created, not when the keyspace is
created. For that, the keyspace is configured with zero initial tabled,
and table-creation time zero is converted into auto-estimated value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For correctness sstable cleanup has to run between (some) topology
changes. Sometimes even a failed topology change may require running
the cleanup. The series introduces automatic sstable cleanup step to the
topology change coordinator. Unlike other operations it is not represented
as a global transition state, but done by each node independently which
allows cleanup to run without locking the topology state machine so
tablet code can run in parallel with the cleanup.
It is done by having a cleanup state flag for each node in the
topology. The flag is a tri state: "clean" - the node is clean, "needed"
- cleanup is needed (but not running), "running" - cleanup is running. No
topology operation can proceed if there is a node in "running" state, but
some operation can proceed even if there are nodes in "needed" state. If
the coordinator needs to perform a topology operation that cannot run while
there are nodes that need cleanup the coordinator will start one
automatically and continue only after cleanup completes. There is also a
possibility to kick cleanup manually through the new RAFT API call.
* 'cleanup-needed-v8' of https://github.com/gleb-cloudius/scylla:
test: add test for automatic cleanup procedure
test: add test for topology requests queue management
storage_service: topology coordinator: add error injection point to be able to pause the topology coordinator
storage_service: topology coordinator: add logging to removenode and decommission
storage_service: topology_coordinator: introduce cleanup REST API integrated with the topology coordinator
storage_service: topology coordinator: manage cluster cleanup as part of the topology management
storage_service: topology coordinator: provide a version of get_excluded_nodes that does not need node_to_work_on as a parameter
test: use servers_see_each_other when needed
test: add servers_see_each_other helper
storage_service: topology coordinator: make topology coordinator lifecycle subscriber
system_keyspace: raft topology: load ignore nodes parameter together with removenode topology request
storage_service: topology coordinator: introduce sstable cleanup fiber
storage_proxy: allow to wait for all ongoing writes
storage_service: topology coordinator: mark nodes as needing cleanup when required
storage_service: add mark_nodes_as_cleanup_needed function
vnode_effective_replication_map: add get_all_pending_nodes() function
vnode_effective_replication_map: pre calculate dirty endpoints during topology change
raft topology: add cleanup state to the topology state machine
The test runs two bootstraps and checks that there is no cleanup
in between. Then it runs a decommission and checks that cleanup runs
automatically and then it runs one more decommission and checks that no
cleanup runs again. Second part checks manual cleanup triggering. It
adds a node, triggers cleanup through the REST API, checks that is runs,
decommissions a node and check that the cleanup did not run again.
This test creates a 5 node cluster with 2 down nodes (A and B). After
that it creates a queue of 3 topology operation: bootstrap, removenode
A and removenode B with ignore_nodes=A. Check that all operation
manage to complete. Then it downs one node and creates a queue with
two requests: bootstrap and decommission. Since none can proceed both
should be canceled.
Introduce new REST API "/storage_service/cleanup_all"
that, when triggered, instructs the topology coordinator to initiate
cluster wide cleanup on all dirty nodes. It is done by introducing new
global command "global_topology_request::cleanup".
Sometimes it is unsafe to start a new topology operation before cleanup
runs on dirty nodes. This patch detects the situation when the topology
operation to be executed cannot be run safely until all dirty nodes do
cleanup and initiates the cleanup automatically. It also waits for
cleanup to complete before proceeding with the topology operation.
There can be a situation that nodes that needs cleanup dies and will
never clear the flag. In this case if a topology operation that wants to
run next does not have this node in its ignore node list it may stuck
forever. To fix this the patch also introduces the "liveness aware"
request queue management: we do not simple choose _a_ request to run next,
but go over the queue and find requests that can proceed considering
the nodes liveness situation. If there are multiple requests eligible to
run the patch introduces the order based on the operation type: replace,
join, remove, leave, rebuild. The order is such so to not trigger cleanup
needlessly.
* seastar 0ffed835...8b9ae36b (4):
> net/posix: Track ap-server ports conflict
Fixes#16720
> include/seastar/core: do not include unused header
> build: expose flag like -std=c++20 via seastar.pc
> src: include used headers for C++ modules build
Closesscylladb/scylladb#16769
In the next patch we want to abort topology operations if there is no
enough live nodes to perform them. This will break tests that do a
topology operation right after restarting a node since a topology
coordinator may still not see the restarted node as alive. Fix all those
tests to wait between restart and a topology operation until UP state
propagates.
We want to change the coordinator to consider nodes liveness when
processing the topology operation queue. If there is no enough live
nodes to process any of the ops we want to cancel them. For that to work
we need to be able to kick the coordinator if liveness situation
changes.
Introduce a fiber that waits on a topology event and when it sees that
the node it runs on needs to perform sstable cleanup it initiates one
for each non tablet, non local table and resets "cleanup" flag back to
"clean" in the topology.
We want to be able to wait for all writes started through the storage
proxy before a fence is advanced. Add phased_barrier that is entered
on each local write operation before checking the fence to do so. A
write will be either tracked by the phased_barrier or fenced. This will
be needed to wait for all non fenced local writes to complete before
starting a cleanup.
A cleanup needs to run when a node loses an ownership of a range (during
bootstrap) or if a range movement to an normal node failed (removenode,
decommission failure). Mark all dirty node as "cleanup needed" in those cases.
The function creates a mutation that sets cleanup to "needed" for each
normal node that, according to the erm, has data it does not own after
successful or unsuccessful topology operation.
Add a function that returns all nodes that have vnode been moved to them
during a topology change operation. Needed to know which nodes need to
do cleanup in case of failed topology change operation.
Some topology change operations causes some nodes loose ranges. This
information is needed to know which nodes need to do cleanup after
topology operation completes. Pre calculate it during erm creation.
The patch adds cleanup state to the persistent and in memory state and
handles the loading. The state can be "clean" which means no cleanup
needed, "needed" which means the node is dirty and needs to run cleanup
at some point, "running" which means that cleanup is running by the node
right now and when it will be completed the state will be reset to "clean".
The loop in `id2ip` lambda makes problems if we are applying an old raft
log that contains long-gone nodes. In this case, we may never receive
the `IP` for a node and stuck in the loop forever. In this series we
replace the loop with an if - we just don't update the `host_id <-> ip`
mapping in the `token_metadata.topology` if we don't have an `IP` yet.
The PR moves `host_id -> IP` resolution to the data plane, now it
happens each time the IP-based methods of `erm` are called. We need this
because IPs may not be known at the time the erm is built. The overhead
of `raft_address_map` lookup is added to each data plane request, but it
should be negligible. In this PR `erm/resolve_endpoints` continues to
treat missing IP for `host_id` as `internal_error`, but we plan to relax
this in the follow-up (see this PR first comment).
Closesscylladb/scylladb#16639
* github.com:scylladb/scylladb:
raft ips: rename gossiper_state_change_subscriber_proxy -> raft_ip_address_updater
gossiper_state_change_subscriber_proxy: call sync_raft_topology_nodes
storage_service: topology_state_load: remove IP waiting loop
storage_service: sync_raft_topology_nodes: add target_node parameter
storage_service: sync_raft_topology_nodes: move loops to the end
storage_service: sync_raft_topology_nodes: rename extract process_left_node and process_transition_node
storage_service: sync_raft_topology_nodes: rename add_normal_node -> process_normal_node
storage_service: sync_raft_topology_nodes: move update_topology up
storage_service: topology_state_load: remove clone_async/clear_gently overhead
storage_service: fix indentation
storage_service: extract sync_raft_topology_nodes
storage_service: topology_state_load: move remove_endpoint into mutate_token_metadata
address_map: move gossiper subscription logic into storage_service
topology_coordinator: exec_global_command: small refactor, use contains + reformat
storage_service: wait_for_ip for new nodes
storage_service.idl.hh: fix raft_topology_cmd.command declaration
erm: for_each_natural_endpoint_until: use is_vnode == true
erm: switch the internal data structures to host_id-s
erm: has_pending_ranges: switch to host_id
When a node changes its IP we need to store the mapping in
system.peers and update token_metadata.topology and erm
in-memory data structures.
The test_change_ip was improved to verify this new
behaviour. Before this patch the test didn't check
that IPs used for data requests are updated on
IP change. In this commit we add the read/write check.
It fails on insert with 'node unavailable'
error without the fix.
The loop makes problems if we are applying an old
raft log that contains long-gone nodes. In this case, we may
never receive the IP for a node and stuck in the loop forever.
The idea of the patch is to replace the loop with an
if - we just don't update the host_id <-> ip mapping
in the token_metadata.topology if we don't have an IP yet.
When we get the mapping later, we'll call
sync_raft_topology_nodes again from
gossiper_state_change_subscriber_proxy.
If it's set, instead of going over all the nodes in raft topology,
the function will update only the specified node. This parameter
will be used in the next commit, in the call to sync_raft_topology_nodes
from gossiper_state_change_subscriber_proxy.
In the following commits we need part of the
topology_state_load logic to be applied
from gossiper_state_change_subscriber_proxy.
In this commit we extract this logic into a
new function sync_raft_topology_nodes.
In the next commit we extract the loops by nodes into
a new function, in this commit we just move them
closer to each other.
Now the remove_endpoint function might be called under
token_metadata_lock (mutate_token_metadata takes it).
It's not a problem since gossiper event handlers in
raft_topology mode doesn't modify token_metadata so
we won't get a deadlock.
We are going to remove the IP waiting loop from topology_state_load
in subsequent commits. An IP for a given host_id may change
after this function has been called by raft. This means we need
to subscribe to the gossiper notifications and call it later
with a new id<->ip mapping.
In this preparatory commit we move the existing address_map
update logic into storage_service so that in later commits
we can enhance it with topology_state_load call.
When a new node joins the cluster we need to be sure that it's IP
is known to all other nodes. In this patch we do this by waiting
for the IP to appear in raft_address_map.
A new raft_topology_cmd::command::wait_for_ip command is added.
It's run on all nodes of the cluster before we put the topology
into transition state. This applies both to new and replacing nodes.
It's important to run wait_for_ip before moving to
topology::transition_state::join_group0 since in this state
node IPs are already used to populate pending nodes in erm.
In a longevity test reported in scylladb/scylladb#16668 we observed that
NORMAL state is not being properly handled for a node that replaced
another node. Either handle_state_normal is not being called, or it is
but getting stuck in the middle. Which is the case couldn't be
determined from the logs, and attempts at creating a local reproducer
failed.
Thus the plan is to continue debugging using the longevity test, but we need
more logs. To check whether `handle_state_normal` was called and which branches
were taken, include some INFO level logs there. Also, detect deadlocks inside
`gossiper::lock_endpoint` by reporting an error message if `lock_endpoint`
waits for the lock for too long.
Ref: scylladb/scylladb#16668Closesscylladb/scylladb#16733
* github.com:scylladb/scylladb:
gossiper: report error when waiting too long for endpoint lock
gossiper: store source_location instead of string in endpoint_permit
storage_service: more verbose logging in handle_state_normal
Compilation fails with recent boost versions (>=1.79.0) due to an
ambiguity with the align_up function call. Fix that by adding type
inference to the function call.
Fixes#16746
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16747
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we
* define a formatter for `db::consistency_level`
* drop its `operator<<`, as it is not used anymore
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16755
This change is intended to remove the dependency to
operator<<(std::ostream&, const std::unordered_set<T>&)
from auth_resource_test.cc.
It prepares the test for removal of the templated helpers
from utils/to_string.hh, which is one of goals of the
referenced issue that is linked below.
Refs: #13245
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#16754
This is an optimisation - for_each_natural_endpoint_until is
called only for vnode tokens, we don't need to run the
binary search for it in tm.first_token.
Also the function is made private since it's only used
in erm itself.