mirror of
https://github.com/versity/scoutfs.git
synced 2026-01-07 20:45:18 +00:00
Previously quorum configuration specified the number of votes needed to elected the leader. This was an excessive amount of freedom in the configuration of the cluster which created all sorts of problems which had to be designed around. Most acutely, though, it required a probabilistic mechanism for mounts to persistently record that they're starting a server so that future servers could find and possibly fence them. They would write to a lot of quorum blocks and trust that it was unlikely that future servers would overwrite all of their written blocks. Overwriting was always possible, which would be bad enough, but it also required so much IO that we had to use long election timeouts to avoid spurious fencing. These longer timeouts had already gone wrong on some storage configurations, leading to hung mounts. To fix this and other problems we see coming, like live membership changes, we now specifically configure the number and identity of mounts which will be participating in quorum voting. With specific identities, mounts now have a corresponding specific block they can write to and which future servers can read from to see if they're still running. We change the quorum config in the super block from a single quorum_count to an array of quorum slots which specify the address of the mount that is assigned to that slot. The mount argument to specify a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr" which specifies the mount's slot. The slot's address is used for udp election messages and tcp server connections. Now that we specifically have configured unique IP addresses for all the quorum members, we can use UDP messages to send and receive the vote mesages in the raft protocol to elect a leader. The quorum code doesn't have to read and write disk block votes and is a more reasonable core loop that either waits for received network messages or timeouts to advance the raft election state machine. The quorum blocks are now used for slots to store their persistent raft term and to set their leader state. We have event fields in the block to record the timestamp of the most recent interesting events that happened to the slot. Now that raft doesn't use IO, we can leave the quorum election work running in the background. The raft work in the quorum members is always running so we can use a much more typical raft implementation with heartbeats. Critically, this decouples the client and election life cycles. Quorum is always running and is responsible for starting and stopping the server. The client repeatedly tries to connect to a server, it has nothing to do with deciding to participate in quorum. Finally, we add a quorum/status sysfs file which shows the state of the quorum raft protocol in a member mount and has the last messages that were sent to or received from the other members. Signed-off-by: Zach Brown <zab@versity.com>