mirror of
https://github.com/versity/scoutfs.git
synced 2026-01-09 05:13:18 +00:00
Use quorum slots and background election work
Previously quorum configuration specified the number of votes needed to elected the leader. This was an excessive amount of freedom in the configuration of the cluster which created all sorts of problems which had to be designed around. Most acutely, though, it required a probabilistic mechanism for mounts to persistently record that they're starting a server so that future servers could find and possibly fence them. They would write to a lot of quorum blocks and trust that it was unlikely that future servers would overwrite all of their written blocks. Overwriting was always possible, which would be bad enough, but it also required so much IO that we had to use long election timeouts to avoid spurious fencing. These longer timeouts had already gone wrong on some storage configurations, leading to hung mounts. To fix this and other problems we see coming, like live membership changes, we now specifically configure the number and identity of mounts which will be participating in quorum voting. With specific identities, mounts now have a corresponding specific block they can write to and which future servers can read from to see if they're still running. We change the quorum config in the super block from a single quorum_count to an array of quorum slots which specify the address of the mount that is assigned to that slot. The mount argument to specify a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr" which specifies the mount's slot. The slot's address is used for udp election messages and tcp server connections. Now that we specifically have configured unique IP addresses for all the quorum members, we can use UDP messages to send and receive the vote mesages in the raft protocol to elect a leader. The quorum code doesn't have to read and write disk block votes and is a more reasonable core loop that either waits for received network messages or timeouts to advance the raft election state machine. The quorum blocks are now used for slots to store their persistent raft term and to set their leader state. We have event fields in the block to record the timestamp of the most recent interesting events that happened to the slot. Now that raft doesn't use IO, we can leave the quorum election work running in the background. The raft work in the quorum members is always running so we can use a much more typical raft implementation with heartbeats. Critically, this decouples the client and election life cycles. Quorum is always running and is responsible for starting and stopping the server. The client repeatedly tries to connect to a server, it has nothing to do with deciding to participate in quorum. Finally, we add a quorum/status sysfs file which shows the state of the quorum raft protocol in a member mount and has the last messages that were sent to or received from the other members. Signed-off-by: Zach Brown <zab@versity.com>
This commit is contained in:
@@ -34,13 +34,10 @@
|
||||
|
||||
/*
|
||||
* The client is responsible for maintaining a connection to the server.
|
||||
* This includes managing quorum elections that determine which client
|
||||
* should run the server that all the clients connect to.
|
||||
*/
|
||||
|
||||
#define CLIENT_CONNECT_DELAY_MS (MSEC_PER_SEC / 10)
|
||||
#define CLIENT_CONNECT_TIMEOUT_MS (1 * MSEC_PER_SEC)
|
||||
#define CLIENT_QUORUM_TIMEOUT_MS (5 * MSEC_PER_SEC)
|
||||
|
||||
struct client_info {
|
||||
struct super_block *sb;
|
||||
@@ -303,27 +300,17 @@ out:
|
||||
* to the server. It's queued on mount and disconnect and we requeue
|
||||
* the work if the work fails and we're not shutting down.
|
||||
*
|
||||
* In the typical case a mount reads the super blocks and finds the
|
||||
* address of the currently running server and connects to it.
|
||||
* Non-quorum member clients who can't connect will keep trying
|
||||
* alternating reading the address and getting connect timeouts.
|
||||
*
|
||||
* Quorum members will try to elect a leader if they can't connect to
|
||||
* the server. When then can't connect and are able to elect a leader
|
||||
* then a new server is started. The new server will write its address
|
||||
* in the super and everyone will be able to connect.
|
||||
* We ask quorum for an address to try and connect to. If there isn't
|
||||
* one, or it fails, we back off a bit before trying again.
|
||||
*
|
||||
* There's a tricky bit of coordination required to safely unmount.
|
||||
* Clients need to tell the server that they won't be coming back with a
|
||||
* farewell request. Once a client receives its farewell response it
|
||||
* can exit. But a majority of quorum members need to stick around to
|
||||
* elect a server to process all their farewell requests. This is
|
||||
* coordinated by having the greeting tell the server that a client is a
|
||||
* quorum member. The server then holds on to farewell requests from
|
||||
* members until only requests from the final quorum remain. These
|
||||
* farewell responses are only sent after updating an unmount barrier in
|
||||
* the super to indicate to the final quorum that they can safely exit
|
||||
* without having received a farewell response over the network.
|
||||
* farewell request. Once the server processes a farewell request from
|
||||
* the client it can forget the client. If the connection is broken
|
||||
* before the client gets the farewell response it doesn't want to
|
||||
* reconnect to send it again.. instead the client can read the metadata
|
||||
* device to check for the lack of an item which indicates that the
|
||||
* server has processed its farewell.
|
||||
*/
|
||||
static void scoutfs_client_connect_worker(struct work_struct *work)
|
||||
{
|
||||
@@ -333,11 +320,9 @@ static void scoutfs_client_connect_worker(struct work_struct *work)
|
||||
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
|
||||
struct scoutfs_super_block *super = NULL;
|
||||
struct mount_options *opts = &sbi->opts;
|
||||
const bool am_quorum = opts->server_addr.sin_addr.s_addr != 0;
|
||||
const bool am_quorum = opts->quorum_slot_nr >= 0;
|
||||
struct scoutfs_net_greeting greet;
|
||||
struct sockaddr_in sin;
|
||||
ktime_t timeout_abs;
|
||||
u64 elected_term;
|
||||
int ret;
|
||||
|
||||
super = kmalloc(sizeof(struct scoutfs_super_block), GFP_NOFS);
|
||||
@@ -359,36 +344,14 @@ static void scoutfs_client_connect_worker(struct work_struct *work)
|
||||
goto out;
|
||||
}
|
||||
|
||||
/* try to connect to the super's server address */
|
||||
scoutfs_addr_to_sin(&sin, &super->server_addr);
|
||||
if (sin.sin_addr.s_addr != 0 && sin.sin_port != 0)
|
||||
ret = scoutfs_net_connect(sb, client->conn, &sin,
|
||||
CLIENT_CONNECT_TIMEOUT_MS);
|
||||
else
|
||||
ret = -ENOTCONN;
|
||||
|
||||
if (ret < 0) {
|
||||
/* non-quorum members will delay then retry connect */
|
||||
if (!am_quorum)
|
||||
goto out;
|
||||
|
||||
/* quorum members try to elect a leader */
|
||||
/* make sure local server isn't writing super during votes */
|
||||
scoutfs_server_stop(sb);
|
||||
|
||||
timeout_abs = ktime_add_ms(ktime_get(),
|
||||
CLIENT_QUORUM_TIMEOUT_MS);
|
||||
|
||||
ret = scoutfs_quorum_election(sb, timeout_abs,
|
||||
le64_to_cpu(super->quorum_server_term),
|
||||
&elected_term);
|
||||
/* start the server if we were asked to */
|
||||
if (elected_term > 0)
|
||||
ret = scoutfs_server_start(sb, &opts->server_addr,
|
||||
elected_term);
|
||||
ret = -ENOTCONN;
|
||||
ret = scoutfs_quorum_server_sin(sb, &sin);
|
||||
if (ret < 0)
|
||||
goto out;
|
||||
|
||||
ret = scoutfs_net_connect(sb, client->conn, &sin,
|
||||
CLIENT_CONNECT_TIMEOUT_MS);
|
||||
if (ret < 0)
|
||||
goto out;
|
||||
}
|
||||
|
||||
/* send a greeting to verify endpoints of each connection */
|
||||
greet.fsid = super->hdr.fsid;
|
||||
|
||||
@@ -139,18 +139,21 @@
|
||||
EXPAND_COUNTER(net_recv_invalid_message) \
|
||||
EXPAND_COUNTER(net_recv_messages) \
|
||||
EXPAND_COUNTER(net_unknown_request) \
|
||||
EXPAND_COUNTER(quorum_cycle) \
|
||||
EXPAND_COUNTER(quorum_elected_leader) \
|
||||
EXPAND_COUNTER(quorum_election_timeout) \
|
||||
EXPAND_COUNTER(quorum_failure) \
|
||||
EXPAND_COUNTER(quorum_read_block) \
|
||||
EXPAND_COUNTER(quorum_read_block_error) \
|
||||
EXPAND_COUNTER(quorum_elected) \
|
||||
EXPAND_COUNTER(quorum_fence_error) \
|
||||
EXPAND_COUNTER(quorum_fence_leader) \
|
||||
EXPAND_COUNTER(quorum_read_invalid_block) \
|
||||
EXPAND_COUNTER(quorum_saw_super_leader) \
|
||||
EXPAND_COUNTER(quorum_timedout) \
|
||||
EXPAND_COUNTER(quorum_write_block) \
|
||||
EXPAND_COUNTER(quorum_write_block_error) \
|
||||
EXPAND_COUNTER(quorum_fenced) \
|
||||
EXPAND_COUNTER(quorum_recv_error) \
|
||||
EXPAND_COUNTER(quorum_recv_heartbeat) \
|
||||
EXPAND_COUNTER(quorum_recv_invalid) \
|
||||
EXPAND_COUNTER(quorum_recv_resignation) \
|
||||
EXPAND_COUNTER(quorum_recv_vote) \
|
||||
EXPAND_COUNTER(quorum_send_heartbeat) \
|
||||
EXPAND_COUNTER(quorum_send_resignation) \
|
||||
EXPAND_COUNTER(quorum_send_request) \
|
||||
EXPAND_COUNTER(quorum_send_vote) \
|
||||
EXPAND_COUNTER(quorum_server_shutdown) \
|
||||
EXPAND_COUNTER(quorum_term_follower) \
|
||||
EXPAND_COUNTER(server_commit_hold) \
|
||||
EXPAND_COUNTER(server_commit_queue) \
|
||||
EXPAND_COUNTER(server_commit_worker) \
|
||||
|
||||
@@ -14,6 +14,7 @@
|
||||
#define SCOUTFS_BLOCK_MAGIC_SRCH_BLOCK 0x897e4a7d
|
||||
#define SCOUTFS_BLOCK_MAGIC_SRCH_PARENT 0xb23a2a05
|
||||
#define SCOUTFS_BLOCK_MAGIC_ALLOC_LIST 0x8a93ac83
|
||||
#define SCOUTFS_BLOCK_MAGIC_QUORUM 0xbc310868
|
||||
|
||||
/*
|
||||
* The super block, quorum block, and file data allocation granularity
|
||||
@@ -54,15 +55,19 @@
|
||||
#define SCOUTFS_SUPER_BLKNO ((64ULL * 1024) >> SCOUTFS_BLOCK_SM_SHIFT)
|
||||
|
||||
/*
|
||||
* A reasonably large region of aligned quorum blocks follow the super
|
||||
* block. Each voting cycle reads the entire region so we don't want it
|
||||
* to be too enormous. 256K seems like a reasonably chunky single IO.
|
||||
* The number of blocks in the region also determines the number of
|
||||
* mounts that have a reasonable probability of not overwriting each
|
||||
* other's random block locations.
|
||||
* A small number of quorum blocks follow the super block, enough of
|
||||
* them to match the starting offset of the super block so the region is
|
||||
* aligned to the power of two that contains it.
|
||||
*/
|
||||
#define SCOUTFS_QUORUM_BLKNO ((256ULL * 1024) >> SCOUTFS_BLOCK_SM_SHIFT)
|
||||
#define SCOUTFS_QUORUM_BLOCKS ((256ULL * 1024) >> SCOUTFS_BLOCK_SM_SHIFT)
|
||||
#define SCOUTFS_QUORUM_BLKNO (SCOUTFS_SUPER_BLKNO + 1)
|
||||
#define SCOUTFS_QUORUM_BLOCKS (SCOUTFS_SUPER_BLKNO - 1)
|
||||
|
||||
/*
|
||||
* Free metadata blocks start after the quorum blocks
|
||||
*/
|
||||
#define SCOUTFS_META_DEV_START_BLKNO \
|
||||
((SCOUTFS_QUORUM_BLKNO + SCOUTFS_QUORUM_BLOCKS) >> \
|
||||
SCOUTFS_BLOCK_SM_LG_SHIFT)
|
||||
|
||||
/*
|
||||
* Start data on the data device aligned as well.
|
||||
@@ -537,49 +542,77 @@ struct scoutfs_xattr {
|
||||
|
||||
#define SCOUTFS_UUID_BYTES 16
|
||||
|
||||
/*
|
||||
* Mounts read all the quorum blocks and write to one random quorum
|
||||
* block during a cycle. The min cycle time limits the per-mount iop
|
||||
* load during elections. The random cycle delay makes it less likely
|
||||
* that mounts will read and write at the same time and miss each
|
||||
* other's writes. An election only completes if a quorum of mounts
|
||||
* vote for a leader before any of their elections timeout. This is
|
||||
* made less likely by the probability that mounts will overwrite each
|
||||
* others random block locations. The max quorum count limits that
|
||||
* probability. 9 mounts only have a 55% chance of writing to unique 4k
|
||||
* blocks in a 256k region. The election timeout is set to include
|
||||
* enough cycles to usually complete the election. Once a leader is
|
||||
* elected it spends a number of cycles writing out blocks with itself
|
||||
* logged as a leader. This reduces the possibility that servers
|
||||
* will have their log entries overwritten and not be fenced.
|
||||
*/
|
||||
#define SCOUTFS_QUORUM_MAX_COUNT 9
|
||||
#define SCOUTFS_QUORUM_CYCLE_LO_MS 10
|
||||
#define SCOUTFS_QUORUM_CYCLE_HI_MS 20
|
||||
#define SCOUTFS_QUORUM_TERM_LO_MS 250
|
||||
#define SCOUTFS_QUORUM_TERM_HI_MS 500
|
||||
#define SCOUTFS_QUORUM_ELECTED_LOG_CYCLES 10
|
||||
#define SCOUTFS_QUORUM_MAX_SLOTS 15
|
||||
|
||||
struct scoutfs_quorum_block {
|
||||
/*
|
||||
* To elect a leader, members race to have their variable election
|
||||
* timeouts expire. If they're first to send a vote request with a
|
||||
* greater term to a majority of waiting members they'll be elected with
|
||||
* a majority. If the timeouts are too close, the vote may be split and
|
||||
* everyone will wait for another cycle of variable timeouts to expire.
|
||||
*
|
||||
* These determine how long it will take to elect a leader once there's
|
||||
* no evidence of a server (no leader quorum blocks on mount; heartbeat
|
||||
* timeout expired.)
|
||||
*/
|
||||
#define SCOUTFS_QUORUM_ELECT_MIN_MS 250
|
||||
#define SCOUTFS_QUORUM_ELECT_VAR_MS 100
|
||||
|
||||
/*
|
||||
* Once a leader is elected they send out heartbeats at regular
|
||||
* intervals to force members to wait the much longer heartbeat timeout.
|
||||
* Once heartbeat timeout expires without receiving a heartbeat they'll
|
||||
* switch over the performing elections.
|
||||
*
|
||||
* These determine how long it could take members to notice that a
|
||||
* leader has gone silent and start to elect a new leader.
|
||||
*/
|
||||
#define SCOUTFS_QUORUM_HB_IVAL_MS 100
|
||||
#define SCOUTFS_QUORUM_HB_TIMEO_MS (5 * MSEC_PER_SEC)
|
||||
|
||||
struct scoutfs_quorum_message {
|
||||
__le64 fsid;
|
||||
__le64 blkno;
|
||||
__le64 version;
|
||||
__le64 term;
|
||||
__le64 write_nr;
|
||||
__le64 voter_rid;
|
||||
__le64 vote_for_rid;
|
||||
__u8 type;
|
||||
__u8 from;
|
||||
__u8 __pad[2];
|
||||
__le32 crc;
|
||||
__u8 log_nr;
|
||||
__u8 __pad[3];
|
||||
struct scoutfs_quorum_log {
|
||||
__le64 term;
|
||||
__le64 rid;
|
||||
struct scoutfs_inet_addr addr;
|
||||
} log[0];
|
||||
};
|
||||
|
||||
#define SCOUTFS_QUORUM_LOG_MAX \
|
||||
((SCOUTFS_BLOCK_SM_SIZE - sizeof(struct scoutfs_quorum_block)) / \
|
||||
sizeof(struct scoutfs_quorum_log))
|
||||
/* a candidate requests a vote */
|
||||
#define SCOUTFS_QUORUM_MSG_REQUEST_VOTE 0
|
||||
/* followers send votes to candidates */
|
||||
#define SCOUTFS_QUORUM_MSG_VOTE 1
|
||||
/* elected leaders broadcast heartbeats to delay elections */
|
||||
#define SCOUTFS_QUORUM_MSG_HEARTBEAT 2
|
||||
/* leaders broadcast as they leave to break heartbeat timeout */
|
||||
#define SCOUTFS_QUORUM_MSG_RESIGNATION 3
|
||||
#define SCOUTFS_QUORUM_MSG_INVALID 4
|
||||
|
||||
/*
|
||||
* The version is currently always 0, but will be used by mounts to
|
||||
* discover that membership has changed.
|
||||
*/
|
||||
struct scoutfs_quorum_config {
|
||||
__le64 version;
|
||||
struct scoutfs_quorum_slot {
|
||||
struct scoutfs_inet_addr addr;
|
||||
} slots[SCOUTFS_QUORUM_MAX_SLOTS];
|
||||
};
|
||||
|
||||
struct scoutfs_quorum_block {
|
||||
struct scoutfs_block_header hdr;
|
||||
__le64 term;
|
||||
__le64 random_write_mark;
|
||||
__le64 flags;
|
||||
struct scoutfs_quorum_block_event {
|
||||
__le64 rid;
|
||||
struct scoutfs_timespec ts;
|
||||
} write, update_term, set_leader, clear_leader, fenced;
|
||||
};
|
||||
|
||||
#define SCOUTFS_QUORUM_BLOCK_LEADER (1 << 0)
|
||||
|
||||
#define SCOUTFS_FLAG_IS_META_BDEV 0x01
|
||||
|
||||
@@ -597,12 +630,8 @@ struct scoutfs_super_block {
|
||||
__le64 total_data_blocks;
|
||||
__le64 first_data_blkno;
|
||||
__le64 last_data_blkno;
|
||||
__le64 quorum_fenced_term;
|
||||
__le64 quorum_server_term;
|
||||
__le64 unmount_barrier;
|
||||
__u8 quorum_count;
|
||||
__u8 __pad[7];
|
||||
struct scoutfs_inet_addr server_addr;
|
||||
struct scoutfs_quorum_config qconf;
|
||||
struct scoutfs_alloc_root meta_alloc[2];
|
||||
struct scoutfs_alloc_root data_alloc;
|
||||
struct scoutfs_alloc_list_head server_meta_avail[2];
|
||||
|
||||
@@ -28,7 +28,7 @@
|
||||
#include "super.h"
|
||||
|
||||
static const match_table_t tokens = {
|
||||
{Opt_server_addr, "server_addr=%s"},
|
||||
{Opt_quorum_slot_nr, "quorum_slot_nr=%s"},
|
||||
{Opt_metadev_path, "metadev_path=%s"},
|
||||
{Opt_err, NULL}
|
||||
};
|
||||
@@ -43,46 +43,6 @@ u32 scoutfs_option_u32(struct super_block *sb, int token)
|
||||
return 0;
|
||||
}
|
||||
|
||||
/* The caller's string is null terminted and can be clobbered */
|
||||
static int parse_ipv4(struct super_block *sb, char *str,
|
||||
struct sockaddr_in *sin)
|
||||
{
|
||||
unsigned long port = 0;
|
||||
__be32 addr;
|
||||
char *c;
|
||||
int ret;
|
||||
|
||||
/* null term port, if specified */
|
||||
c = strchr(str, ':');
|
||||
if (c)
|
||||
*c = '\0';
|
||||
|
||||
/* parse addr */
|
||||
addr = in_aton(str);
|
||||
if (ipv4_is_multicast(addr) || ipv4_is_lbcast(addr) ||
|
||||
ipv4_is_zeronet(addr) ||
|
||||
ipv4_is_local_multicast(addr)) {
|
||||
scoutfs_err(sb, "invalid unicast ipv4 address: %s", str);
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
/* parse port, if specified */
|
||||
if (c) {
|
||||
c++;
|
||||
ret = kstrtoul(c, 0, &port);
|
||||
if (ret != 0 || port == 0 || port >= U16_MAX) {
|
||||
scoutfs_err(sb, "invalid port in ipv4 address: %s", c);
|
||||
return -EINVAL;
|
||||
}
|
||||
}
|
||||
|
||||
sin->sin_family = AF_INET;
|
||||
sin->sin_addr.s_addr = addr;
|
||||
sin->sin_port = cpu_to_be16(port);
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int parse_bdev_path(struct super_block *sb, substring_t *substr,
|
||||
char **bdev_path_ret)
|
||||
{
|
||||
@@ -132,14 +92,15 @@ out:
|
||||
int scoutfs_parse_options(struct super_block *sb, char *options,
|
||||
struct mount_options *parsed)
|
||||
{
|
||||
char ipstr[INET_ADDRSTRLEN + 1];
|
||||
substring_t args[MAX_OPT_ARGS];
|
||||
int nr;
|
||||
int token;
|
||||
char *p;
|
||||
int ret;
|
||||
|
||||
/* Set defaults */
|
||||
memset(parsed, 0, sizeof(*parsed));
|
||||
parsed->quorum_slot_nr = -1;
|
||||
|
||||
while ((p = strsep(&options, ",")) != NULL) {
|
||||
if (!*p)
|
||||
@@ -147,12 +108,23 @@ int scoutfs_parse_options(struct super_block *sb, char *options,
|
||||
|
||||
token = match_token(p, tokens, args);
|
||||
switch (token) {
|
||||
case Opt_server_addr:
|
||||
case Opt_quorum_slot_nr:
|
||||
|
||||
match_strlcpy(ipstr, args, ARRAY_SIZE(ipstr));
|
||||
ret = parse_ipv4(sb, ipstr, &parsed->server_addr);
|
||||
if (ret < 0)
|
||||
if (parsed->quorum_slot_nr != -1) {
|
||||
scoutfs_err(sb, "multiple quorum_slot_nr options provided, only provide one.");
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
ret = match_int(args, &nr);
|
||||
if (ret < 0 || nr < 0 ||
|
||||
nr >= SCOUTFS_QUORUM_MAX_SLOTS) {
|
||||
scoutfs_err(sb, "invalid quorum_slot_nr option, must be between 0 and %u",
|
||||
SCOUTFS_QUORUM_MAX_SLOTS - 1);
|
||||
if (ret == 0)
|
||||
ret = -EINVAL;
|
||||
return ret;
|
||||
}
|
||||
parsed->quorum_slot_nr = nr;
|
||||
break;
|
||||
case Opt_metadev_path:
|
||||
|
||||
|
||||
@@ -6,13 +6,13 @@
|
||||
#include "format.h"
|
||||
|
||||
enum scoutfs_mount_options {
|
||||
Opt_server_addr,
|
||||
Opt_quorum_slot_nr,
|
||||
Opt_metadev_path,
|
||||
Opt_err,
|
||||
};
|
||||
|
||||
struct mount_options {
|
||||
struct sockaddr_in server_addr;
|
||||
int quorum_slot_nr;
|
||||
char *metadev_path;
|
||||
};
|
||||
|
||||
|
||||
1597
kmod/src/quorum.c
1597
kmod/src/quorum.c
File diff suppressed because it is too large
Load Diff
@@ -1,10 +1,15 @@
|
||||
#ifndef _SCOUTFS_QUORUM_H_
|
||||
#define _SCOUTFS_QUORUM_H_
|
||||
|
||||
int scoutfs_quorum_election(struct super_block *sb, ktime_t timeout_abs,
|
||||
u64 prev_term, u64 *elected_term);
|
||||
void scoutfs_quorum_clear_leader(struct super_block *sb);
|
||||
int scoutfs_quorum_server_sin(struct super_block *sb, struct sockaddr_in *sin);
|
||||
void scoutfs_quorum_server_shutdown(struct super_block *sb);
|
||||
|
||||
u8 scoutfs_quorum_votes_needed(struct super_block *sb);
|
||||
void scoutfs_quorum_slot_sin(struct scoutfs_super_block *super, int i,
|
||||
struct sockaddr_in *sin);
|
||||
|
||||
int scoutfs_quorum_setup(struct super_block *sb);
|
||||
void scoutfs_quorum_shutdown(struct super_block *sb);
|
||||
void scoutfs_quorum_destroy(struct super_block *sb);
|
||||
|
||||
#endif
|
||||
|
||||
@@ -1797,118 +1797,69 @@ TRACE_EVENT(scoutfs_lock_message,
|
||||
__entry->old_mode, __entry->new_mode)
|
||||
);
|
||||
|
||||
DECLARE_EVENT_CLASS(scoutfs_quorum_message_class,
|
||||
TP_PROTO(struct super_block *sb, u64 term, u8 type, int nr),
|
||||
|
||||
TRACE_EVENT(scoutfs_quorum_election,
|
||||
TP_PROTO(struct super_block *sb, u64 prev_term),
|
||||
|
||||
TP_ARGS(sb, prev_term),
|
||||
TP_ARGS(sb, term, type, nr),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
SCSB_TRACE_FIELDS
|
||||
__field(__u64, prev_term)
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
SCSB_TRACE_ASSIGN(sb);
|
||||
__entry->prev_term = prev_term;
|
||||
),
|
||||
|
||||
TP_printk(SCSBF" prev_term %llu",
|
||||
SCSB_TRACE_ARGS, __entry->prev_term)
|
||||
);
|
||||
|
||||
TRACE_EVENT(scoutfs_quorum_election_ret,
|
||||
TP_PROTO(struct super_block *sb, int ret, u64 elected_term),
|
||||
|
||||
TP_ARGS(sb, ret, elected_term),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
SCSB_TRACE_FIELDS
|
||||
__field(int, ret)
|
||||
__field(__u64, elected_term)
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
SCSB_TRACE_ASSIGN(sb);
|
||||
__entry->ret = ret;
|
||||
__entry->elected_term = elected_term;
|
||||
),
|
||||
|
||||
TP_printk(SCSBF" ret %d elected_term %llu",
|
||||
SCSB_TRACE_ARGS, __entry->ret, __entry->elected_term)
|
||||
);
|
||||
|
||||
TRACE_EVENT(scoutfs_quorum_election_vote,
|
||||
TP_PROTO(struct super_block *sb, int role, u64 term, u64 vote_for_rid,
|
||||
int votes, int log_cycles, int quorum_count),
|
||||
|
||||
TP_ARGS(sb, role, term, vote_for_rid, votes, log_cycles, quorum_count),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
SCSB_TRACE_FIELDS
|
||||
__field(int, role)
|
||||
__field(__u64, term)
|
||||
__field(__u64, vote_for_rid)
|
||||
__field(int, votes)
|
||||
__field(int, log_cycles)
|
||||
__field(int, quorum_count)
|
||||
__field(__u8, type)
|
||||
__field(int, nr)
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
SCSB_TRACE_ASSIGN(sb);
|
||||
__entry->role = role;
|
||||
__entry->term = term;
|
||||
__entry->vote_for_rid = vote_for_rid;
|
||||
__entry->votes = votes;
|
||||
__entry->log_cycles = log_cycles;
|
||||
__entry->quorum_count = quorum_count;
|
||||
__entry->type = type;
|
||||
__entry->nr = nr;
|
||||
),
|
||||
|
||||
TP_printk(SCSBF" role %d term %llu vote_for_rid %016llx votes %d log_cycles %d quorum_count %d",
|
||||
SCSB_TRACE_ARGS, __entry->role, __entry->term,
|
||||
__entry->vote_for_rid, __entry->votes, __entry->log_cycles,
|
||||
__entry->quorum_count)
|
||||
TP_printk(SCSBF" term %llu type %u nr %d",
|
||||
SCSB_TRACE_ARGS, __entry->term, __entry->type, __entry->nr)
|
||||
);
|
||||
DEFINE_EVENT(scoutfs_quorum_message_class, scoutfs_quorum_send_message,
|
||||
TP_PROTO(struct super_block *sb, u64 term, u8 type, int nr),
|
||||
TP_ARGS(sb, term, type, nr)
|
||||
);
|
||||
DEFINE_EVENT(scoutfs_quorum_message_class, scoutfs_quorum_recv_message,
|
||||
TP_PROTO(struct super_block *sb, u64 term, u8 type, int nr),
|
||||
TP_ARGS(sb, term, type, nr)
|
||||
);
|
||||
|
||||
DECLARE_EVENT_CLASS(scoutfs_quorum_block_class,
|
||||
TP_PROTO(struct super_block *sb, struct scoutfs_quorum_block *blk),
|
||||
TRACE_EVENT(scoutfs_quorum_loop,
|
||||
TP_PROTO(struct super_block *sb, int role, u64 term, int vote_for,
|
||||
unsigned long vote_bits, struct timespec64 timeout),
|
||||
|
||||
TP_ARGS(sb, blk),
|
||||
TP_ARGS(sb, role, term, vote_for, vote_bits, timeout),
|
||||
|
||||
TP_STRUCT__entry(
|
||||
SCSB_TRACE_FIELDS
|
||||
__field(__u64, blkno)
|
||||
__field(__u64, term)
|
||||
__field(__u64, write_nr)
|
||||
__field(__u64, voter_rid)
|
||||
__field(__u64, vote_for_rid)
|
||||
__field(__u32, crc)
|
||||
__field(__u8, log_nr)
|
||||
__field(int, role)
|
||||
__field(int, vote_for)
|
||||
__field(unsigned long, vote_bits)
|
||||
__field(unsigned long, vote_count)
|
||||
__field(unsigned long long, timeout_sec)
|
||||
__field(int, timeout_nsec)
|
||||
),
|
||||
|
||||
TP_fast_assign(
|
||||
SCSB_TRACE_ASSIGN(sb);
|
||||
__entry->blkno = le64_to_cpu(blk->blkno);
|
||||
__entry->term = le64_to_cpu(blk->term);
|
||||
__entry->write_nr = le64_to_cpu(blk->write_nr);
|
||||
__entry->voter_rid = le64_to_cpu(blk->voter_rid);
|
||||
__entry->vote_for_rid = le64_to_cpu(blk->vote_for_rid);
|
||||
__entry->crc = le32_to_cpu(blk->crc);
|
||||
__entry->log_nr = blk->log_nr;
|
||||
__entry->term = term;
|
||||
__entry->role = role;
|
||||
__entry->vote_for = vote_for;
|
||||
__entry->vote_bits = vote_bits;
|
||||
__entry->vote_count = hweight_long(vote_bits);
|
||||
__entry->timeout_sec = timeout.tv_sec;
|
||||
__entry->timeout_nsec = timeout.tv_nsec;
|
||||
),
|
||||
|
||||
TP_printk(SCSBF" blkno %llu term %llu write_nr %llu voter_rid %016llx vote_for_rid %016llx crc 0x%08x log_nr %u",
|
||||
SCSB_TRACE_ARGS, __entry->blkno, __entry->term,
|
||||
__entry->write_nr, __entry->voter_rid, __entry->vote_for_rid,
|
||||
__entry->crc, __entry->log_nr)
|
||||
);
|
||||
DEFINE_EVENT(scoutfs_quorum_block_class, scoutfs_quorum_read_block,
|
||||
TP_PROTO(struct super_block *sb, struct scoutfs_quorum_block *blk),
|
||||
TP_ARGS(sb, blk)
|
||||
);
|
||||
DEFINE_EVENT(scoutfs_quorum_block_class, scoutfs_quorum_write_block,
|
||||
TP_PROTO(struct super_block *sb, struct scoutfs_quorum_block *blk),
|
||||
TP_ARGS(sb, blk)
|
||||
TP_printk(SCSBF" term %llu role %d vote_for %d vote_bits 0x%lx vote_count %lu timeout %llu.%u",
|
||||
SCSB_TRACE_ARGS, __entry->term, __entry->role,
|
||||
__entry->vote_for, __entry->vote_bits, __entry->vote_count,
|
||||
__entry->timeout_sec, __entry->timeout_nsec)
|
||||
);
|
||||
|
||||
/*
|
||||
|
||||
@@ -59,7 +59,6 @@ struct server_info {
|
||||
int err;
|
||||
bool shutting_down;
|
||||
struct completion start_comp;
|
||||
struct sockaddr_in listen_sin;
|
||||
u64 term;
|
||||
struct scoutfs_net_connection *conn;
|
||||
|
||||
@@ -1362,7 +1361,7 @@ static void farewell_worker(struct work_struct *work)
|
||||
/* send as many responses as we can to maintain quorum */
|
||||
while ((fw = list_first_entry_or_null(&reqs, struct farewell_request,
|
||||
entry)) &&
|
||||
(nr_mounted > super->quorum_count ||
|
||||
(nr_mounted > scoutfs_quorum_votes_needed(sb) ||
|
||||
nr_unmounting >= nr_mounted)) {
|
||||
|
||||
list_move_tail(&fw->entry, &send);
|
||||
@@ -1544,18 +1543,17 @@ static void scoutfs_server_worker(struct work_struct *work)
|
||||
struct super_block *sb = server->sb;
|
||||
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
|
||||
struct scoutfs_super_block *super = &sbi->super;
|
||||
struct mount_options *opts = &sbi->opts;
|
||||
struct scoutfs_net_connection *conn = NULL;
|
||||
DECLARE_WAIT_QUEUE_HEAD(waitq);
|
||||
struct sockaddr_in sin;
|
||||
LIST_HEAD(conn_list);
|
||||
u64 max_vers;
|
||||
int ret;
|
||||
int err;
|
||||
|
||||
trace_scoutfs_server_work_enter(sb, 0, 0);
|
||||
|
||||
sin = server->listen_sin;
|
||||
|
||||
scoutfs_quorum_slot_sin(super, opts->quorum_slot_nr, &sin);
|
||||
scoutfs_info(sb, "server setting up at "SIN_FMT, SIN_ARG(&sin));
|
||||
|
||||
conn = scoutfs_net_alloc_conn(sb, server_notify_up, server_notify_down,
|
||||
@@ -1575,9 +1573,6 @@ static void scoutfs_server_worker(struct work_struct *work)
|
||||
goto out;
|
||||
}
|
||||
|
||||
if (ret)
|
||||
goto out;
|
||||
|
||||
/* start up the server subsystems before accepting */
|
||||
ret = scoutfs_read_super(sb, super);
|
||||
if (ret < 0)
|
||||
@@ -1617,19 +1612,6 @@ static void scoutfs_server_worker(struct work_struct *work)
|
||||
if (ret)
|
||||
goto shutdown;
|
||||
|
||||
/*
|
||||
* Write our address in the super before it's possible for net
|
||||
* processing to start writing the super as part of
|
||||
* transactions. In theory clients could be trying to connect
|
||||
* to our address without having seen it in the super (maybe
|
||||
* they saw it a long time ago).
|
||||
*/
|
||||
scoutfs_addr_from_sin(&super->server_addr, &sin);
|
||||
super->quorum_server_term = cpu_to_le64(server->term);
|
||||
ret = scoutfs_write_super(sb, super);
|
||||
if (ret < 0)
|
||||
goto shutdown;
|
||||
|
||||
/* start accepting connections and processing work */
|
||||
server->conn = conn;
|
||||
scoutfs_net_listen(sb, conn);
|
||||
@@ -1656,30 +1638,14 @@ shutdown:
|
||||
scoutfs_lock_server_destroy(sb);
|
||||
|
||||
out:
|
||||
scoutfs_quorum_clear_leader(sb);
|
||||
scoutfs_net_free_conn(sb, conn);
|
||||
|
||||
/* let quorum know that we've shutdown */
|
||||
scoutfs_quorum_server_shutdown(sb);
|
||||
|
||||
scoutfs_info(sb, "server stopped at "SIN_FMT, SIN_ARG(&sin));
|
||||
trace_scoutfs_server_work_exit(sb, 0, ret);
|
||||
|
||||
/*
|
||||
* Always try to clear our presence in the super so that we're
|
||||
* not fenced. We do this last because other mounts will try to
|
||||
* reach quorum the moment they see zero here. The later we do
|
||||
* this the longer we have to finish shutdown while clients
|
||||
* timeout.
|
||||
*/
|
||||
err = scoutfs_read_super(sb, super);
|
||||
if (err == 0) {
|
||||
super->quorum_fenced_term = cpu_to_le64(server->term);
|
||||
memset(&super->server_addr, 0, sizeof(super->server_addr));
|
||||
err = scoutfs_write_super(sb, super);
|
||||
}
|
||||
if (err < 0) {
|
||||
scoutfs_err(sb, "failed to clear election term %llu at "SIN_FMT", this mount could be fenced",
|
||||
server->term, SIN_ARG(&sin));
|
||||
}
|
||||
|
||||
server->err = ret;
|
||||
complete(&server->start_comp);
|
||||
}
|
||||
@@ -1689,14 +1655,12 @@ out:
|
||||
* the super block's fence_term has been set to the new server's term so
|
||||
* that it won't be fenced.
|
||||
*/
|
||||
int scoutfs_server_start(struct super_block *sb, struct sockaddr_in *sin,
|
||||
u64 term)
|
||||
int scoutfs_server_start(struct super_block *sb, u64 term)
|
||||
{
|
||||
DECLARE_SERVER_INFO(sb, server);
|
||||
|
||||
server->err = 0;
|
||||
server->shutting_down = false;
|
||||
server->listen_sin = *sin;
|
||||
server->term = term;
|
||||
init_completion(&server->start_comp);
|
||||
|
||||
|
||||
@@ -69,8 +69,7 @@ int scoutfs_server_apply_commit(struct super_block *sb, int err);
|
||||
|
||||
struct sockaddr_in;
|
||||
struct scoutfs_quorum_elected_info;
|
||||
int scoutfs_server_start(struct super_block *sb, struct sockaddr_in *sin,
|
||||
u64 term);
|
||||
int scoutfs_server_start(struct super_block *sb, u64 term);
|
||||
void scoutfs_server_abort(struct super_block *sb);
|
||||
void scoutfs_server_stop(struct super_block *sb);
|
||||
|
||||
|
||||
@@ -176,7 +176,8 @@ static int scoutfs_show_options(struct seq_file *seq, struct dentry *root)
|
||||
struct super_block *sb = root->d_sb;
|
||||
struct mount_options *opts = &SCOUTFS_SB(sb)->opts;
|
||||
|
||||
seq_printf(seq, ",server_addr="SIN_FMT, SIN_ARG(&opts->server_addr));
|
||||
if (opts->quorum_slot_nr >= 0)
|
||||
seq_printf(seq, ",quorum_slot_nr=%d", opts->quorum_slot_nr);
|
||||
seq_printf(seq, ",metadev_path=%s", opts->metadev_path);
|
||||
|
||||
return 0;
|
||||
@@ -192,20 +193,19 @@ static ssize_t metadev_path_show(struct kobject *kobj,
|
||||
}
|
||||
SCOUTFS_ATTR_RO(metadev_path);
|
||||
|
||||
static ssize_t server_addr_show(struct kobject *kobj,
|
||||
static ssize_t quorum_server_nr_show(struct kobject *kobj,
|
||||
struct kobj_attribute *attr, char *buf)
|
||||
{
|
||||
struct super_block *sb = SCOUTFS_SYSFS_ATTRS_SB(kobj);
|
||||
struct mount_options *opts = &SCOUTFS_SB(sb)->opts;
|
||||
|
||||
return snprintf(buf, PAGE_SIZE, SIN_FMT"\n",
|
||||
SIN_ARG(&opts->server_addr));
|
||||
return snprintf(buf, PAGE_SIZE, "%d\n", opts->quorum_slot_nr);
|
||||
}
|
||||
SCOUTFS_ATTR_RO(server_addr);
|
||||
SCOUTFS_ATTR_RO(quorum_server_nr);
|
||||
|
||||
static struct attribute *mount_options_attrs[] = {
|
||||
SCOUTFS_ATTR_PTR(metadev_path),
|
||||
SCOUTFS_ATTR_PTR(server_addr),
|
||||
SCOUTFS_ATTR_PTR(quorum_server_nr),
|
||||
NULL,
|
||||
};
|
||||
|
||||
@@ -257,15 +257,12 @@ static void scoutfs_put_super(struct super_block *sb)
|
||||
scoutfs_item_destroy(sb);
|
||||
scoutfs_forest_destroy(sb);
|
||||
|
||||
/* the server locks the listen address and compacts */
|
||||
scoutfs_quorum_destroy(sb);
|
||||
scoutfs_lock_shutdown(sb);
|
||||
scoutfs_server_destroy(sb);
|
||||
scoutfs_net_destroy(sb);
|
||||
scoutfs_lock_destroy(sb);
|
||||
|
||||
/* server clears quorum leader flag during shutdown */
|
||||
scoutfs_quorum_destroy(sb);
|
||||
|
||||
scoutfs_block_destroy(sb);
|
||||
scoutfs_destroy_triggers(sb);
|
||||
scoutfs_options_destroy(sb);
|
||||
@@ -390,17 +387,8 @@ static int scoutfs_read_super_from_bdev(struct super_block *sb,
|
||||
|
||||
/* XXX do we want more rigorous invalid super checking? */
|
||||
|
||||
if (super->quorum_count == 0 ||
|
||||
super->quorum_count > SCOUTFS_QUORUM_MAX_COUNT) {
|
||||
scoutfs_err(sb, "super block has invalid quorum count %u, must be > 0 and <= %u",
|
||||
super->quorum_count, SCOUTFS_QUORUM_MAX_COUNT);
|
||||
ret = -EINVAL;
|
||||
goto out;
|
||||
}
|
||||
|
||||
if (invalid_blkno_limits(sb, "meta",
|
||||
(SCOUTFS_QUORUM_BLKNO + SCOUTFS_QUORUM_BLOCKS)
|
||||
<< SCOUTFS_BLOCK_SM_LG_SHIFT,
|
||||
SCOUTFS_META_DEV_START_BLKNO,
|
||||
super->first_meta_blkno,
|
||||
super->last_meta_blkno, sbi->meta_bdev,
|
||||
SCOUTFS_BLOCK_LG_SHIFT) ||
|
||||
@@ -605,8 +593,8 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
|
||||
scoutfs_setup_trans(sb) ?:
|
||||
scoutfs_lock_setup(sb) ?:
|
||||
scoutfs_net_setup(sb) ?:
|
||||
scoutfs_quorum_setup(sb) ?:
|
||||
scoutfs_server_setup(sb) ?:
|
||||
scoutfs_quorum_setup(sb) ?:
|
||||
scoutfs_client_setup(sb) ?:
|
||||
scoutfs_lock_rid(sb, SCOUTFS_LOCK_WRITE, 0, sbi->rid,
|
||||
&sbi->rid_lock) ?:
|
||||
|
||||
Reference in New Issue
Block a user