Commit Graph

1398 Commits

Author SHA1 Message Date
Zach Brown
3a03a6a20c Add SUBTREE btree walk flag to restrict join/merge
Add a BTW_SUBTREE flag to btree_walk() to restrict splitting or joining
of the root block.   When clients are merging into the root built from a
reference to the last parent in the fs tree we want to be careful that
we maintain a single root block that can be spliced back into the fs
tree.   We specifically check that the root block remain within the
split/join thresholds.  If it falls out of compliance we return an error
so that it can be spliced back into the fs tree and then split/joined
with its siblings.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
b6d0a45f6d Add btree_{get,set}_parent
Add calls for working with subtrees built around references to blocks in
the last level of parents.  This will let the server farm out btree
merging work where concurrency is built around safely working with all
the items and leaves that fall under a given parent block.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
d7f8896fac Add scoutfs_btree_parent_range
Add a btree helper for finding the range of keys which are found in
leaves referenced by the last parent block when searching for a given
key.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
65c39e5f97 Item seq is max of trans and lock write_seq
Rename the item version to seq and set it to the max of the transaction
seq and the lock's write_seq.  This lets btree item merging chose a seq
at which all dirty items written in future commits must have greater
seqs.  It can drop the seqs from items written to the fs tree during
btree merging knowing that there aren't any older items out in
transactions that could be mistaken for newer items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
3c69861c03 Use core seq for lock write_seq
Rename the write_version lock field to write_seq and get it from the
core seq in the super block.

We're doing this to create a relationship between a client transaction's
seq and a lock's write_seq.  New transactions will have a greater seq
than all previously granted write locks and new write locks will have a
greater seq than all open transactions.  This will be used to resolve
ambiguities in item merging as transaction seqs are written out of order
and write locks span transactions.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:24:23 -07:00
Zach Brown
05ae756b74 Get trans seq from core seq
Get the next seq for a client transaction from the core seq in the super
block.  Remove its specific next_trans_seq field.

While making this change we switch to only using le64 in the network
message payloads, the rest of the processing now uses natural u64s.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:46:19 -07:00
Zach Brown
9051ceb6fc Add core seq to the super block
Add a new seq field to the super block which will be the source of all
incremented seqs throughout the system.  We give out incremented seqs to
callers with an atomic64_t in memory which is synced back to the super
block as we commit transactions in the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:33:30 -07:00
Zach Brown
bad1c602f9 server hold_commit returns void
When we moved to the current allocator we fixed up the server commit
path to initialize the pair of allocators as a commit is finished rather
than before it starts.  This removed all the error cases from
hold_commit.  Remove the error handling from hold_commit calls to make
the system just a bit simpler.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:32:26 -07:00
Zach Brown
cee6ad34d3 Merge pull request #42 from versity/zab/fencing_and_reclaiming
Zab/fencing and reclaiming
2021-06-01 11:12:51 -07:00
Zach Brown
38a4a56741 Stop writing to other quorum slot blocks
The core quorum work loop assumes that it has exclusive access to its
slot's quorum block.  It uniquely marks blocks it writes and verifies
the marks on read to discover if another mount has written to its slot
under the assumption that this must be a configuration error that put
two mounts in the same slot.

But the design of the leader bit in the block violates the invariant
that only a slot will write to its block.   As the server comes up and
fences previous leaders it writes to their block to clear their leader
bit.

The final hole in the design is that because we're fencing mounts, not
slots, each slot can have two mounts in play.  An active mount can be
using the slot and there can still be a persistent record of a previous
mount in the slot that crashed that needs to be fenced.

All this comes together to have the server fence an old mount in a slot
while a new mount is coming up.  The new mount sees the mark change and
freaks out and stops participating in quorum.

The fix is to rework the quorum blocks so that each slot only writes to
its own block.  Instead of the server writing to each fenced mount's
slot, it writes a fence event to its block once all previous mounts have
been fenced.  We add a bit of bookkeeping so that the server can
discover when all block leader fence operations have completed.  Each
event gets its own term so we can compare events to discover live
servers.

We get rid of the write marks and instead have an event that is written
as a quorum agent starts up and is then checked on every read to make
sure it still matches.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-31 13:10:45 -07:00
Zach Brown
76076011a2 Add scoutfs-fenced man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
bdc0282fa7 Describe fencing in the scoutfs.5 man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1199bac91d Fix quorum server shutdown
If the server shuts down it calls into quorum to tell it that the
server has exited.  This stops quorum from sending heartbeats that
suppress other leader elections.

The function that did this got the logic wrong.  It was setting the bit
instead of clearing it, having been initially written to set a bit when
the server exited.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1e460e5cb0 Add scoutfs-fenced and its run scripts to spec
Install the scoutfs-fenced daemon and its run scripts in the rpm spec
file.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
877e30d60f Add client address to mounted_client item
Add the peername of the client's connected socket to its mounted_client
item as it mounts.  If the client doesn't recover then fencing can use
the IP to find the host to fence.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
a972e42fba Update dmesg filters for fencing and reclaim
Add regexes for the messages that come from fencing and reclaiming
resources from fenced mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
0706669047 Clean up quorum block read error messages
The error messages from reading quorum blocks were confusing.  The mark
was being checked when the block had already seen an error, and we got
multiple messages for some errors.

This cleans it up a bit so we only get one error message for each error
source and each message contains relevant context.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
76cef6fdfc Let _recov_next_pending iterate over rids
Currently the server's recovery timeout work synchronously reclaims
resources for each client whose recovery timed out.
scoutfs_recov_next_pending() can always return the head of the pending
list because its caller will always remove it from the list as it
iterates.

As we move to real fencing the server will be creating fence requests
for all the timed out clients concurrently.  It will need to iterate
over all the rids for clients in recovery.

So we sort recovery's pending list by rid and change _recov_next_pending
to return the next pending rid after a rid argument.  This lets the
server iterate over all the pending rids at once.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
aad2d3db59 Add stage_tmpfile to .gitignore
We missed adding this newly added binary to .gitignore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
933fc687c3 omap remove_rid might not find entry
Client recovery in the server doesn't add the omap rid for all the
clients that it's waiting for.  It only adds the rid as they connect.  A
client whose recovery timeout expires and is evicted will try to have
its omap rid removed without being added.

Today this triggers a warning and returns an error from a time when the
omap rid lifecycle was more rigid.  Now that it's being called by the
server's reclaim_rid, along with a bunch of other functions that succeed
if called for non-existant clients, let's have the omap remove_rid do
the same.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
6663034295 Run the fence agent in the background of tests
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
ab5466a771 Protect server shutting down with smp barriers
I saw a confusing hang that looked like a lack of ordering between
a waker setting shutting_down and a wait event testing it after
being woken up.  Let's see if more barriers help.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
f3764b873b Save previous connected client address
Our connection state spans sockets that can disconnect and reconnect.
While sockets are connected we store the socket's remote address in the
connection's peername and we clear it as sockets disconnect.

Fencing wants to know the last connected address of the mount.  It's a
bit of metadata we know about the mount that can be used to find it and
fence it.  As we store the peer address we also stash it away as the
last known peer address for the socket.  Fencing can then use that
instead of the current socket peer address which is guaranteed to be
uninitialized because there's no socket connected.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
9ebc9d0f66 Manage client reconnect delay
The client currently always queues immediate connect work if it's
nodify_down is called.  It was assuming that notify_down is only called
from a healthy established connection.   But it's also called for
unsuccessful conneect attempts that might not have timed out.  Say the
host is up but the port isn't listening.

This results in spamming connection attempts while an old stale leader
block until a new server is elected, fences the previous leader, and
updates their quorum block.

The fix is to explicitly manage the connection work queueing delay.  We
only set it to immediately queue on mount and when we see a greeting
reply from the server.  We always set it to a longer timeout as we start
a connection attempt.  This means we'll always have a long reconnect
delay unless we really connected to a server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
8b78f701a1 Add fence-and-reclaim test
Add a test which exercises the various reasons for fencing mounts and
checks that we reclaim the resources that they had.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
1f1f40f079 Add fence agent that processes fence requests
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
943351944a Call fencing from the server
The server is responsible for calling the fencing subsystem.  It is the
source of fencing requests as it decides that previous mounts are
unresponsive.  It is responsible for reclaiming resources for fenced
mounts and freeing their associated fence request.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
b060eb4f5d Add fencing subsystem
Add the subsystem which tracks pending fence requests and exposes them
to userspace for processing.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:25 -07:00
Zach Brown
2dde729791 Add sysfs create attr w/ parent
Add sysfs attribute creation that can provide the parent dir kobject
instead of always creating the sysfs object dir off of the main
per-mount dir.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:19 -07:00
Zach Brown
ccb7c0bf4b Add rw sysfs attr wrapper
Add a wrapper around __ATTR_RW so that callers can add attributes with a
_store function.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:07 -07:00
Zach Brown
e9d04dcf8d Add forced unmount support
Add super_ops->umount_begin so that we can implement a forced unmount
which tries to avoid issuing any more network or storage ops.  It can
return errors and lose unsynchronized data.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:02:20 -07:00
Zach Brown
5dceac32db Merge pull request #40 from versity/zab/data_alloc_zones
Zab/data alloc zones
2021-05-24 13:00:48 -07:00
Zach Brown
ef440ead28 Add -z to run-test for data-alloc-zone-blocks
Add an option to run-tests which gets passed through to the
data-alloc-zone-blocks argument for mkfs.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
d0b04e790c Add data-alloc-zone-blocks argument to mkfs
Add an argument to mkfs which sets the data_alloc_zone_blocks volume
option.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
54644a5074 Add data_alloc_zone_blocks volume option
Add the data_alloc_zone_blocks volume option.  This changes the
behaviour of the server to try and give mounts free data extents which
fall in exclusive fixed-size zones.

We add the field to the scoutfs_volume_options struct and add it to the
set_volopt server handler which enforces constrains on the size of the
zones.

We then add fields to the log_trees struct which records the size of the
zones and sets bits for the zones that contain free extents in the
data_avail allocator root.  The get_log_trees handler is changed to read
all the zone bitmaps from all the items, pass those bitmaps in to
_alloc_move to direct data allocations, and finally update the bitmaps
in the log_trees items to cover the newly allocated extents.  The
log_trees data_alloc_zone fields are cleared as the mount's logs are
reclaimed to indicate that the mount is no longer writing to the zone.

The policy mechanism of finding free extents based on the bitmaps is
ipmlemented down in _data_alloc_move().

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
52c2a465db Add zone awareness to scoutfs_alloc_move()
Add parameters so that scoutfs_alloc_move() can first search for source
extents in specified zones.  It uses relatively cheap searches through
the order items to find extents that intersect with the regions
described by the zone bitmaps.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
bc4975fad4 Add scoutfs_alloc_extents_cb()
Add an allocator call for getting a callback for all the extents in
btree items in an allocator root.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
9de3ae6dcb Index free extents by order of length
Allocators store free extents in two items, one sorted by their blkno
position and the other by their precise length.

The length index makes it easy to search for precise extent lengths, but
it makes it hard to search for a large extent within a given blkno
region.  Skipping in the blkno dimension has to be done for every
precise length value.

We don't need that level of precision.  If we index the extents by a
coarser order of the length then we have a fixed number of orders in
which we have to skip in the blkno dimension when searching within a
specific region.

This changes the length item to be stored at the log(8) order of the
length of the extents.  This groups extents into orders that are close
to the human-friendly base 10 orders of magnitude.

With this change the order field in the key no longer stores the precise
extent length.  To preserve the length of the extent we need to use
another field.  The only 64bit field remaining is the first which is a
higher comparision priority than the type.  So we use the highest
comparison priority zone field to differentiate the position and order
indexes and can now use all three 64bit fields in the key.

Finally, we have to be careful when constructing a key to use _next when
searching for a large extent.  Previously keys were relying on the magic
property that building a key from an extent length of 0 ended up at the
key value -0 = 0.  That only worked because we never stored zero length
extents.  We now store zero length orders so we can't use the negative
trick anymore.  We explicitly treat 0 length extents carefully when
building keys and we subtract the order from U64_MAX to store the orders
from largest to smallest.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:25:56 -07:00
Zach Brown
0aa6005c99 Add volume options super, server, and sysfs
Introduce global volume options.  They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-19 14:15:06 -07:00
Zach Brown
973dc4fd1c Merge pull request #38 from versity/zab/read_xattr_deadlocks
Zab/read xattr deadlocks
2021-05-03 09:44:57 -07:00
Zach Brown
a5ca5ee36d Put back-to-back invalidated locks back on list
A lock that is undergoing invalidation is put on a list of locks in the
super block.  Invalidation requests put locks on the list.  While locks
are invalidated they're temporarily put on a private list.

To support a request arriving while the lock is being processed we
carefully manage the invalidation fields in the lock between the
invalidation worker and the incoming request.  The worker correctly
noticed that a new invalidation request had arrived but it left the lock
on its private list instead of putting it back on the invalidation list
for further processing.  The lock was unreachable, wouldn't get
invalidated, and caused everyone trying to use the lock to block
indefinitely.

When the worker sees another request arrive for an invalidating lock it
needs to move the lock from the private list back to the invalidation
list.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-30 10:00:07 -07:00
Zach Brown
603af327ac Ignore I_FREEING in all inode hash lookups
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);

Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).

I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget.  We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.

Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave.  They'll get NULL instead of
waiting from ilookup.  They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.

We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.

This change does allow for concurrent iget of an inode number that is
being deleted on a local node.  This could happen in fh_to_dentry with a
raw inode number.  But this was already a problem between mounts because
they don't have a shared inode cache to serialize them.  Once we fix
that between nodes, we fix it on a single node as well.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:22:10 -07:00
Zach Brown
ca320d02cb Get i_mutex before cluster lock in file aio_read
The vfs often calls filesystem methods with i_mutex held.  This creates
a natural ordering of i_mutex outside of cluster locks.  The file
aio_read method acquired i_mutex after its cluster lock, creating a
deadlock with other vfs methods like setattr.

The acquisition of i_mutex after the cluster lock was due to using the
pattern where we use the per-task lock to discover if we're the first
user of the lock in a call chain.  Readpage has to do this, but file
aio_read doesn't.  It should never be called recursively.  So we can
acquire the i_mutex outside of the cluster lock and warn if we ever are
called recursively.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:11:06 -07:00
Zach Brown
5231cf4034 Add export-lookup-evict-race test
Add a test that creates races between fh_to_dentry and eviction
triggered by lock invalidation.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:11:06 -07:00
Andy Grover
f631058265 Merge pull request #37 from versity/zab/test_mkdir_rename_unlink
Add mkdir-rename-rmdir test
2021-04-27 13:21:27 -07:00
Zach Brown
1b4e60cae4 Add mkdir-rename-rmdir test
Add a test which performs mkdir, two renames of the dir, and rmdir on
all possible combinations of mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-27 12:01:43 -07:00
Andy Grover
6eeaab3322 Merge pull request #35 from versity/zab/invalidate_already_pending
Handle back to back invalidation requests
2021-04-23 16:40:45 -07:00
Andy Grover
ac68d14b8d Merge pull request #36 from versity/zab/move_blocks_next_einval
Fix accidental EINVAL in move_blocks
2021-04-23 14:39:29 -07:00
Zach Brown
ecfc8a0d0e Merge pull request #33 from versity/zab/open_ino_map
Zab/open ino map
2021-04-23 10:55:11 -07:00
Zach Brown
63148d426e Fix accidental EINVAL in move_blocks
When move blocks is staging it requires an overlapping offline extent to
cover the entire region to move.

It performs the stage by modifying extents at a time.  If there are
fragmented source extents it will modify each of them at a time in the
region.

When looking for the extent to match the source extent it looked from
the iblock of the start of the whole operation, not the start of the
source extent it's matching.  This meant that it would find a the first
previous online extent it just modified, which wouldn't be online, and
would return -EINVAL.

The fix is to have it search from the logical start of the extent it's
trying to match, not the start of the region.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-23 10:39:34 -07:00