Previously quorum configuration specified the number of votes needed to
elected the leader. This was an excessive amount of freedom in the
configuration of the cluster which created all sorts of problems which
had to be designed around.
Most acutely, though, it required a probabilistic mechanism for mounts
to persistently record that they're starting a server so that future
servers could find and possibly fence them. They would write to a lot
of quorum blocks and trust that it was unlikely that future servers
would overwrite all of their written blocks. Overwriting was always
possible, which would be bad enough, but it also required so much IO
that we had to use long election timeouts to avoid spurious fencing.
These longer timeouts had already gone wrong on some storage
configurations, leading to hung mounts.
To fix this and other problems we see coming, like live membership
changes, we now specifically configure the number and identity of mounts
which will be participating in quorum voting. With specific identities,
mounts now have a corresponding specific block they can write to and
which future servers can read from to see if they're still running.
We change the quorum config in the super block from a single
quorum_count to an array of quorum slots which specify the address of
the mount that is assigned to that slot. The mount argument to specify
a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr"
which specifies the mount's slot. The slot's address is used for udp
election messages and tcp server connections.
Now that we specifically have configured unique IP addresses for all the
quorum members, we can use UDP messages to send and receive the vote
mesages in the raft protocol to elect a leader. The quorum code doesn't
have to read and write disk block votes and is a more reasonable core
loop that either waits for received network messages or timeouts to
advance the raft election state machine.
The quorum blocks are now used for slots to store their persistent raft
term and to set their leader state. We have event fields in the block
to record the timestamp of the most recent interesting events that
happened to the slot.
Now that raft doesn't use IO, we can leave the quorum election work
running in the background. The raft work in the quorum members is
always running so we can use a much more typical raft implementation
with heartbeats. Critically, this decouples the client and election
life cycles. Quorum is always running and is responsible for starting
and stopping the server. The client repeatedly tries to connect to a
server, it has nothing to do with deciding to participate in quorum.
Finally, we add a quorum/status sysfs file which shows the state of the
quorum raft protocol in a member mount and has the last messages that
were sent to or received from the other members.
Signed-off-by: Zach Brown <zab@versity.com>
Prefer named to anonymous enums. This helps readability a little.
Use enum as param type if possible (a couple spots).
Remove unused enum in lock_server.c.
Define enum spbm_flags using shift notation for consistency.
Rename get_file_block()'s "gfb" parameter to "flags" for consistency.
Signed-off-by: Andy Grover <agrover@versity.com>
Require a second path to metadata bdev be given via mount option.
Verify meta sb matches sb also written to data sb. Change code as needed
in super.c to allow both to be read. Remove check for overlapping
meta and data blknos, since they are now on entirely separate bdevs.
Use meta_bdev for superblock, quorum, and block.c reads and writes.
Signed-off-by: Andy Grover <agrover@versity.com>
It used to take significant effort to create very tall btrees because
they only stored small references to large LSM segments. Now they store
all file system metadata and we can easily create sufficiently large
btrees for testing. We don't need the tiny btree option.
Signed-off-by: Zach Brown <zab@versity.com>
Add a server_addr mount option that takes an ipv4 address. This will be
used by the upcoming changes to quorum voting to indicate that a mount
should participate in voting and to specify the address that its server
should listen on.
Signed-off-by: Zach Brown <zab@versity.com>
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem. That isn't going to
work if we're moving to locking provided by the server.
This uses quorum election to determine who should run the server. We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.
Signed-off-by: Zach Brown <zab@versity.com>
Each mount is getting a specified unique name. This can be used to
identify a reconnecting mount that indicates that an old instance of the
same unique name can no longer exist and doesn't need to be fenced.
Signed-off-by: Zach Brown <zab@versity.com>
Add a tunable option to force using tiny btree blocks on an active
mount. This lets us quickly exercise large btrees.
Signed-off-by: Zach Brown <zab@versity.com>
To actually use it, we first have to copy symbols over from the dlm build
into the scoutfs source directory. Make that happen automatically for us in
the Makefile.
The only users of locking at the moment are mount, unmount and xattr
read/write. Adding more locking calls should be a straight-forward endeavor.
The LVB based server ip communication didn't work out, and LVBS as they are
written don't make sense in a range locking world. So instead, we record the
server ip address in the superblock. This is protected by the listen lock,
which also arbitrates which node will be the manifest server.
We take and drop the dlm lock on each lock/unlock call. Lock caching will
come in a future patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>