Commit Graph

1423 Commits

Author SHA1 Message Date
Zach Brown
24d682bf81 Add orphan-inodes test
Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
2957f3e301 Avoid warnings when evict has signals pending
Killing a task can end up in evict and break out of acquiring the locks
to perform final inode deletion.  This isn't necessarily fatal.  The
orphan task will come around and will delete the inode when it is truly
no longer referenced.

So let's silence the error and keep track of how many times it happens.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
07210b5734 Reliably delete orphaned inodes
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages.  The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.

This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.

We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items.  Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks.  Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.

Then we refresh the orphan inode scanning function.  It now runs
regularly in the background of all mounts.  It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:52:46 -07:00
Zach Brown
0374661a92 Merge pull request #43 from versity/zab/btree_merging
Zab/btree merging
2021-06-22 13:16:30 -07:00
Zach Brown
28759f3269 Rotate srch files as log trees items are reclaimed
The log merging work deletes log trees items once their item roots are
merged back into the fs root.  Those deleted items could still have
populated srch files that would be lost.  We force rotation of the srch
files in the items as they're reclaimed to turn them into rotated srch
files that can be compacted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:37:45 -07:00
Zach Brown
5c3fdb48af Fix btree join item movement
Refilling a btree block by moving items from its siblings as it falls
under the join threshold had some pretty serious mistakes.  It used the
target block's total item count instead of the siblings when deciding
how many items to move.  It didn't take item moving overruns into
account when deciding to compact so it could run out of contiguous free
space as it moved the last item.  And once it compacted it returned
without moving because the return was meant to be in the error case.

This is all fixed by correctly examining the sibling block to determine
if we should join a block up to 75% full or move a big chunk over,
compacting if the free space doesn't have room for an excessive worst
case overrun, and fixing the compaction error checking return typo.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
a7828a6410 Add log merge item allocators to alloc detail
The alloc iterator needs to find and include the totals of the avail and
freed allocator list heads in the log merge items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
a1d46e1a92 Fix mkfs btree item offset calculation
mkfs was miscalculating the offset of the start of the free region in
the center of blocks as it populated blocks with items.  It was using
the length of the free region as its offset in the block.  To find
the offset of the end of the free region in the block it has to be
taken relative to the end of the item array.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
d67db6662b Fix item cache val_len alignment math
Some item_val_len() callers were applying alignment twice, which isn't
needed.

And additions to erased_bytes as value lengths change  didn't take
alignment into account.  They could end up double counting if val_len
changes within the alignment are then accounted for again as the full
item and alignment is later deleted.  Additions to erased_bytes based on
val_len should always take alignment into account.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c5c050bef0 Item cache might free null page on alloc error
The item cache allocates a page and a little tracking struct for each
cached page.  If the page allocation fails it might try to free a null
page pointer, which isn't allowed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
96d286d6e5 Zero btree item padding as items are created
Item creation, which fills out a new item at the end of the array of
item structs at the start of the block, didn't explicitly zero the item
struct padding to 0.  It would only have been zero if the memory was
already zero, which is likely for new blocks, but isn't necessarily true
if the memory had previously been used by deleted values.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9febc6b5dc Update btree block validator for 8byte alignment
The change to aligning values didn't update the btree block verifier's
total length calculation, and while we're in there we can also check
that values are correctly aligned.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
045b3ca8d4 Expand unused btree verifying walker
Previously we had an unused function that could be flipped on to verify
btree blocks during traversal.   This refactors the block verifier a bit
to be called by a verifying walker.  This will let callers walk paths to
leaves to verify the tree around operations, rather than verification
being performed during the next walk.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
ff882a4c4f Add btree total_above_join_low_water() test
Take the condition used to decide if a btree block needs to be joined
and put it in total_above_join_low_water() so that btree_merging will be
able to call it to see if the leaf block it's merging into needs to be
joined.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3d1a0f06c0 Add scoutfs_btree_free_blocks
Add a btree function for freeing all the blocks in a btree without
having to cow the blocks to track which refs have been freed.  We use a
key from the caller to track which portions of the tree have been freed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3488b4e6e0 Add scoutfs print support for log merge items
Add support for printing all the items in the log_merge tree that the
server uses to track log merging.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c482204fcf Clean up btree root printing in superblock
Over time the printing of the btree roots embedded in the super block
has gotten a little out of hand.  Add a helper macro for the printf
format and args and re-order them to match their order in the
superblock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9711fef122 Update for core, trans, and item seq use
We now have a core seq number in the super that is advanced for multiple
users.    The client transaction seq comes from the core seq so we
remove the trans_seq from the super.  The item version is also converted
to use a seq that's derived from the core seq.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
91acf92666 Add client btree merge processing
Add the client work which is regularly scheduled to ask the server for
log merging work to do.  The relatively simple client work gets a
request from the server, finds the log roots to merge given the reqeust
seq, performs the merge with a btree call and callbacks, and commits the
result to the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9c2122f7de Add server btree merge processing
This adds the server processing side of the btree merge functionality.
The client isn't yet sending the log_merge messages so no merging will
be performed.

The bulk of the work happens as the server processess a get_log_merge
message to build a merge request for the client.  It starts a log merge
if one isn't in flight.  If one is in flight it checks to see if it
should be spliced and maybe finished.  In the common case it finds the
next range to be merged and sends the request to the client to process.

The commit_log_merge handler is the completion side of that request.  If
the request failed then we unwind its resources based on the stored
request item.  If it succeeds we record it in an item for get_
processing to splice eventually.

Then we modify two existing server code paths.

First, get_log_tree doesn't just create or use a single existing log
btree for a client mount.  If the existing log btree is large enough it
sets its finalized flag and advances the nr to use a new log btree.
That makes the old finalized log btree available for merging.

Then we need to be a bit more careful when reclaiming the open log btree
for a client.  We can't use next to find the only open log btree, we use
prev to find the last and make sure that it isn't already finalized.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
4d3ea3b59b Add format support for log btree merging
Add the format specification for the upcoming btree merging.  Log btrees
gain a finalized field, we add the super btree root and all the items
that the server will use to coordinate merging amongst clients, and we
add the two client net messages which the server will implement.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
298a6a8865 Add server get_stable_trans_seq()
Extract part of the get_last_seq handler into a call that finds the last
stable client transaction seq.  Log merging needs this to determine a
cutoff for stable items in log btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
082924df1a Add scoutfs_key_is_ones()
Add a quick inline for testing that a key is all ones.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
d8478ed6f1 Add scoutfs_btree_rebalance()
Add a btree call to just dirty to a leaf block, joining and splitting
along the way so that the blocks in the path satisfy the balance
constraints.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
0538c882bc Add btree_merge()
Add a btree function for merging the items in a range from a number of
read-only input btrees into a destination btree.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3a03a6a20c Add SUBTREE btree walk flag to restrict join/merge
Add a BTW_SUBTREE flag to btree_walk() to restrict splitting or joining
of the root block.   When clients are merging into the root built from a
reference to the last parent in the fs tree we want to be careful that
we maintain a single root block that can be spliced back into the fs
tree.   We specifically check that the root block remain within the
split/join thresholds.  If it falls out of compliance we return an error
so that it can be spliced back into the fs tree and then split/joined
with its siblings.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
b6d0a45f6d Add btree_{get,set}_parent
Add calls for working with subtrees built around references to blocks in
the last level of parents.  This will let the server farm out btree
merging work where concurrency is built around safely working with all
the items and leaves that fall under a given parent block.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
d7f8896fac Add scoutfs_btree_parent_range
Add a btree helper for finding the range of keys which are found in
leaves referenced by the last parent block when searching for a given
key.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
65c39e5f97 Item seq is max of trans and lock write_seq
Rename the item version to seq and set it to the max of the transaction
seq and the lock's write_seq.  This lets btree item merging chose a seq
at which all dirty items written in future commits must have greater
seqs.  It can drop the seqs from items written to the fs tree during
btree merging knowing that there aren't any older items out in
transactions that could be mistaken for newer items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
3c69861c03 Use core seq for lock write_seq
Rename the write_version lock field to write_seq and get it from the
core seq in the super block.

We're doing this to create a relationship between a client transaction's
seq and a lock's write_seq.  New transactions will have a greater seq
than all previously granted write locks and new write locks will have a
greater seq than all open transactions.  This will be used to resolve
ambiguities in item merging as transaction seqs are written out of order
and write locks span transactions.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:24:23 -07:00
Zach Brown
05ae756b74 Get trans seq from core seq
Get the next seq for a client transaction from the core seq in the super
block.  Remove its specific next_trans_seq field.

While making this change we switch to only using le64 in the network
message payloads, the rest of the processing now uses natural u64s.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:46:19 -07:00
Zach Brown
9051ceb6fc Add core seq to the super block
Add a new seq field to the super block which will be the source of all
incremented seqs throughout the system.  We give out incremented seqs to
callers with an atomic64_t in memory which is synced back to the super
block as we commit transactions in the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:33:30 -07:00
Zach Brown
bad1c602f9 server hold_commit returns void
When we moved to the current allocator we fixed up the server commit
path to initialize the pair of allocators as a commit is finished rather
than before it starts.  This removed all the error cases from
hold_commit.  Remove the error handling from hold_commit calls to make
the system just a bit simpler.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:32:26 -07:00
Zach Brown
cee6ad34d3 Merge pull request #42 from versity/zab/fencing_and_reclaiming
Zab/fencing and reclaiming
2021-06-01 11:12:51 -07:00
Zach Brown
38a4a56741 Stop writing to other quorum slot blocks
The core quorum work loop assumes that it has exclusive access to its
slot's quorum block.  It uniquely marks blocks it writes and verifies
the marks on read to discover if another mount has written to its slot
under the assumption that this must be a configuration error that put
two mounts in the same slot.

But the design of the leader bit in the block violates the invariant
that only a slot will write to its block.   As the server comes up and
fences previous leaders it writes to their block to clear their leader
bit.

The final hole in the design is that because we're fencing mounts, not
slots, each slot can have two mounts in play.  An active mount can be
using the slot and there can still be a persistent record of a previous
mount in the slot that crashed that needs to be fenced.

All this comes together to have the server fence an old mount in a slot
while a new mount is coming up.  The new mount sees the mark change and
freaks out and stops participating in quorum.

The fix is to rework the quorum blocks so that each slot only writes to
its own block.  Instead of the server writing to each fenced mount's
slot, it writes a fence event to its block once all previous mounts have
been fenced.  We add a bit of bookkeeping so that the server can
discover when all block leader fence operations have completed.  Each
event gets its own term so we can compare events to discover live
servers.

We get rid of the write marks and instead have an event that is written
as a quorum agent starts up and is then checked on every read to make
sure it still matches.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-31 13:10:45 -07:00
Zach Brown
76076011a2 Add scoutfs-fenced man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
bdc0282fa7 Describe fencing in the scoutfs.5 man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1199bac91d Fix quorum server shutdown
If the server shuts down it calls into quorum to tell it that the
server has exited.  This stops quorum from sending heartbeats that
suppress other leader elections.

The function that did this got the logic wrong.  It was setting the bit
instead of clearing it, having been initially written to set a bit when
the server exited.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1e460e5cb0 Add scoutfs-fenced and its run scripts to spec
Install the scoutfs-fenced daemon and its run scripts in the rpm spec
file.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
877e30d60f Add client address to mounted_client item
Add the peername of the client's connected socket to its mounted_client
item as it mounts.  If the client doesn't recover then fencing can use
the IP to find the host to fence.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
a972e42fba Update dmesg filters for fencing and reclaim
Add regexes for the messages that come from fencing and reclaiming
resources from fenced mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
0706669047 Clean up quorum block read error messages
The error messages from reading quorum blocks were confusing.  The mark
was being checked when the block had already seen an error, and we got
multiple messages for some errors.

This cleans it up a bit so we only get one error message for each error
source and each message contains relevant context.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
76cef6fdfc Let _recov_next_pending iterate over rids
Currently the server's recovery timeout work synchronously reclaims
resources for each client whose recovery timed out.
scoutfs_recov_next_pending() can always return the head of the pending
list because its caller will always remove it from the list as it
iterates.

As we move to real fencing the server will be creating fence requests
for all the timed out clients concurrently.  It will need to iterate
over all the rids for clients in recovery.

So we sort recovery's pending list by rid and change _recov_next_pending
to return the next pending rid after a rid argument.  This lets the
server iterate over all the pending rids at once.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
aad2d3db59 Add stage_tmpfile to .gitignore
We missed adding this newly added binary to .gitignore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
933fc687c3 omap remove_rid might not find entry
Client recovery in the server doesn't add the omap rid for all the
clients that it's waiting for.  It only adds the rid as they connect.  A
client whose recovery timeout expires and is evicted will try to have
its omap rid removed without being added.

Today this triggers a warning and returns an error from a time when the
omap rid lifecycle was more rigid.  Now that it's being called by the
server's reclaim_rid, along with a bunch of other functions that succeed
if called for non-existant clients, let's have the omap remove_rid do
the same.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
6663034295 Run the fence agent in the background of tests
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
ab5466a771 Protect server shutting down with smp barriers
I saw a confusing hang that looked like a lack of ordering between
a waker setting shutting_down and a wait event testing it after
being woken up.  Let's see if more barriers help.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
f3764b873b Save previous connected client address
Our connection state spans sockets that can disconnect and reconnect.
While sockets are connected we store the socket's remote address in the
connection's peername and we clear it as sockets disconnect.

Fencing wants to know the last connected address of the mount.  It's a
bit of metadata we know about the mount that can be used to find it and
fence it.  As we store the peer address we also stash it away as the
last known peer address for the socket.  Fencing can then use that
instead of the current socket peer address which is guaranteed to be
uninitialized because there's no socket connected.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
9ebc9d0f66 Manage client reconnect delay
The client currently always queues immediate connect work if it's
nodify_down is called.  It was assuming that notify_down is only called
from a healthy established connection.   But it's also called for
unsuccessful conneect attempts that might not have timed out.  Say the
host is up but the port isn't listening.

This results in spamming connection attempts while an old stale leader
block until a new server is elected, fences the previous leader, and
updates their quorum block.

The fix is to explicitly manage the connection work queueing delay.  We
only set it to immediately queue on mount and when we see a greeting
reply from the server.  We always set it to a longer timeout as we start
a connection attempt.  This means we'll always have a long reconnect
delay unless we really connected to a server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
8b78f701a1 Add fence-and-reclaim test
Add a test which exercises the various reasons for fencing mounts and
checks that we reclaim the resources that they had.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00