Compare commits

...

118 Commits

Author SHA1 Message Date
Zach Brown
96f2ad29dc Add inode crtime creation time
Add an inode creation time field.  It's created for all new inodes.
It's visible to stat_more.  setattr_more can set it during
restore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-08 11:00:30 -07:00
Zach Brown
53f11f5479 Merge pull request #46 from versity/zab/orphan_deletion_and_enospc
Zab/orphan deletion and enospc
2021-07-08 10:52:53 -07:00
Zach Brown
b4ede2ac6a Allow omap responses to disconnected originators
The omap message lifecycle is a little different than the server's usual
handling that sends a response from the request handler.  The response
is sent long after the initial receive handler is pinning the connection
to the client.   It's fine for the response to be dropped.

The main server request handler handled this case but other response
senders didn't.  Put this error handling in the server response sender
itself so that all callers are covered.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-08 09:36:07 -07:00
Zach Brown
cbe8d77f78 Prevent duplicate inode item deletion
We hide I_FREEING inodes from inode lookup to avoid inversions with
cluster locking.  This can result in duplicate inodes structs for a
given inode number.  Then can both race to try and delete the same items
for their shared inode number.  This leads to error messages from
evict_inode and could lead to corruption if they, for example, both try
and free the same data extents.

This adds very basic serialization so only one instance can try to
delete items at a time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
5f682dabb5 Item cache invalidation uses seqs to avoid readers
The item cache has to be careful not to insert stale read items when
previously dirty items have been written and invalidated while a read
was in flight.

This was previously done by recording the possible range of items that a
reader could see based on the key range of its lock.   This is
disasterous when a workload operates entirely within one lock.  I ran
into this when testing a small number of files with massive amounts of
xattrs.  While any reader is in flight all pages can't be invalidated
because they all intersect with the one lock that covers all the items
in use.

The fix is to more naturally reflect the problem by tracking the
greatest item seq in pages and the earliest seq that any readers
can't see.  This lets invalidate only skip pages with items
that weren't visible to the earliest reader.

This more naturally reflects that the problem is due to the age of the
items, not their position in the key space.  Now only a few of the most
recently modified pages could be skipped and they'll be at the end
of the LRU and won't typically be visited.  As an added benefit it's
now much cheaper to add, delete, and test the active readers.

This fix stopped rm -rf of a full system's worth of xattrs from taking
minutes constantly spinning skipping all pages in the LRU to seconds of
doing real removal work.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
120c2d342a Add create_xattr_loop test tool
Add a quick tool that creates xattrs in a tight loop.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
84454b38c5 Add mkfs -A for small device sizes
Normally mkfs would fail if we specify meta or data devices that are too
small.  We'd like to use small devices for test scenarios, though, so
add an option to allow specifying sizes smaller than the minumum
required sizes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
29cfa81574 Remove unused leftovers from quorum changes
These forward declarations were for interfaces that have since been
removed or changed and are no longer needed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
73bf916182 Return ENOSPC as space gets low
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free.  This adds support for
returning ENOSPC to client posix allocators as free space gets low.

For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space.  The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks.  In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing).  When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.

Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.

For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.

The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.

We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.

We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.

And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
9db3b475c0 Stop log merge work earlier during unmount
The forest log merge work calls into the client to send commit requests
to the server.  The forest is usually destroyed relatively late in the
sequence and can still be running after the client is destroyed.

Adding a _forest_stop call lets us stop the log merging work
before the client is destroyed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
24d682bf81 Add orphan-inodes test
Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
2957f3e301 Avoid warnings when evict has signals pending
Killing a task can end up in evict and break out of acquiring the locks
to perform final inode deletion.  This isn't necessarily fatal.  The
orphan task will come around and will delete the inode when it is truly
no longer referenced.

So let's silence the error and keep track of how many times it happens.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
07210b5734 Reliably delete orphaned inodes
Orphaned items haven't been deleted for quite a while -- the call to the
orphan inode scanner has been commented out for ages.  The deletion of
the orphan item didn't take rid zone locking into account as we moved
deletion from being strictly local to being performed by whoever last
used the inode.

This reworks orphan item management and brings back orphan inode
scanning to correctly delete orphaned inodes.

We get rid of the rid zone that was always _WRITE locked by each mount.
That made it impossible for other mounts to get a _WRITE lock to delete
orphan items.  Instead we rename it to the orphan zone and have orphan
item callers get _WRITE_ONLY locks inside their inode locks.  Now all
nodes can create and delete orphan items as they have _WRITE locks on
the associated inodes.

Then we refresh the orphan inode scanning function.  It now runs
regularly in the background of all mounts.  It avoids creating cluster
lock contention by finding candidates with unlocked forest hint reads
and by testing inode caches locally and via the open map before properly
locking and trying to delete the inode's items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:52:46 -07:00
Zach Brown
0374661a92 Merge pull request #43 from versity/zab/btree_merging
Zab/btree merging
2021-06-22 13:16:30 -07:00
Zach Brown
28759f3269 Rotate srch files as log trees items are reclaimed
The log merging work deletes log trees items once their item roots are
merged back into the fs root.  Those deleted items could still have
populated srch files that would be lost.  We force rotation of the srch
files in the items as they're reclaimed to turn them into rotated srch
files that can be compacted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:37:45 -07:00
Zach Brown
5c3fdb48af Fix btree join item movement
Refilling a btree block by moving items from its siblings as it falls
under the join threshold had some pretty serious mistakes.  It used the
target block's total item count instead of the siblings when deciding
how many items to move.  It didn't take item moving overruns into
account when deciding to compact so it could run out of contiguous free
space as it moved the last item.  And once it compacted it returned
without moving because the return was meant to be in the error case.

This is all fixed by correctly examining the sibling block to determine
if we should join a block up to 75% full or move a big chunk over,
compacting if the free space doesn't have room for an excessive worst
case overrun, and fixing the compaction error checking return typo.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
a7828a6410 Add log merge item allocators to alloc detail
The alloc iterator needs to find and include the totals of the avail and
freed allocator list heads in the log merge items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
a1d46e1a92 Fix mkfs btree item offset calculation
mkfs was miscalculating the offset of the start of the free region in
the center of blocks as it populated blocks with items.  It was using
the length of the free region as its offset in the block.  To find
the offset of the end of the free region in the block it has to be
taken relative to the end of the item array.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
d67db6662b Fix item cache val_len alignment math
Some item_val_len() callers were applying alignment twice, which isn't
needed.

And additions to erased_bytes as value lengths change  didn't take
alignment into account.  They could end up double counting if val_len
changes within the alignment are then accounted for again as the full
item and alignment is later deleted.  Additions to erased_bytes based on
val_len should always take alignment into account.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c5c050bef0 Item cache might free null page on alloc error
The item cache allocates a page and a little tracking struct for each
cached page.  If the page allocation fails it might try to free a null
page pointer, which isn't allowed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
96d286d6e5 Zero btree item padding as items are created
Item creation, which fills out a new item at the end of the array of
item structs at the start of the block, didn't explicitly zero the item
struct padding to 0.  It would only have been zero if the memory was
already zero, which is likely for new blocks, but isn't necessarily true
if the memory had previously been used by deleted values.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9febc6b5dc Update btree block validator for 8byte alignment
The change to aligning values didn't update the btree block verifier's
total length calculation, and while we're in there we can also check
that values are correctly aligned.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
045b3ca8d4 Expand unused btree verifying walker
Previously we had an unused function that could be flipped on to verify
btree blocks during traversal.   This refactors the block verifier a bit
to be called by a verifying walker.  This will let callers walk paths to
leaves to verify the tree around operations, rather than verification
being performed during the next walk.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
ff882a4c4f Add btree total_above_join_low_water() test
Take the condition used to decide if a btree block needs to be joined
and put it in total_above_join_low_water() so that btree_merging will be
able to call it to see if the leaf block it's merging into needs to be
joined.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3d1a0f06c0 Add scoutfs_btree_free_blocks
Add a btree function for freeing all the blocks in a btree without
having to cow the blocks to track which refs have been freed.  We use a
key from the caller to track which portions of the tree have been freed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3488b4e6e0 Add scoutfs print support for log merge items
Add support for printing all the items in the log_merge tree that the
server uses to track log merging.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
c482204fcf Clean up btree root printing in superblock
Over time the printing of the btree roots embedded in the super block
has gotten a little out of hand.  Add a helper macro for the printf
format and args and re-order them to match their order in the
superblock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9711fef122 Update for core, trans, and item seq use
We now have a core seq number in the super that is advanced for multiple
users.    The client transaction seq comes from the core seq so we
remove the trans_seq from the super.  The item version is also converted
to use a seq that's derived from the core seq.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
91acf92666 Add client btree merge processing
Add the client work which is regularly scheduled to ask the server for
log merging work to do.  The relatively simple client work gets a
request from the server, finds the log roots to merge given the reqeust
seq, performs the merge with a btree call and callbacks, and commits the
result to the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
9c2122f7de Add server btree merge processing
This adds the server processing side of the btree merge functionality.
The client isn't yet sending the log_merge messages so no merging will
be performed.

The bulk of the work happens as the server processess a get_log_merge
message to build a merge request for the client.  It starts a log merge
if one isn't in flight.  If one is in flight it checks to see if it
should be spliced and maybe finished.  In the common case it finds the
next range to be merged and sends the request to the client to process.

The commit_log_merge handler is the completion side of that request.  If
the request failed then we unwind its resources based on the stored
request item.  If it succeeds we record it in an item for get_
processing to splice eventually.

Then we modify two existing server code paths.

First, get_log_tree doesn't just create or use a single existing log
btree for a client mount.  If the existing log btree is large enough it
sets its finalized flag and advances the nr to use a new log btree.
That makes the old finalized log btree available for merging.

Then we need to be a bit more careful when reclaiming the open log btree
for a client.  We can't use next to find the only open log btree, we use
prev to find the last and make sure that it isn't already finalized.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
4d3ea3b59b Add format support for log btree merging
Add the format specification for the upcoming btree merging.  Log btrees
gain a finalized field, we add the super btree root and all the items
that the server will use to coordinate merging amongst clients, and we
add the two client net messages which the server will implement.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
298a6a8865 Add server get_stable_trans_seq()
Extract part of the get_last_seq handler into a call that finds the last
stable client transaction seq.  Log merging needs this to determine a
cutoff for stable items in log btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
082924df1a Add scoutfs_key_is_ones()
Add a quick inline for testing that a key is all ones.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
d8478ed6f1 Add scoutfs_btree_rebalance()
Add a btree call to just dirty to a leaf block, joining and splitting
along the way so that the blocks in the path satisfy the balance
constraints.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
0538c882bc Add btree_merge()
Add a btree function for merging the items in a range from a number of
read-only input btrees into a destination btree.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
3a03a6a20c Add SUBTREE btree walk flag to restrict join/merge
Add a BTW_SUBTREE flag to btree_walk() to restrict splitting or joining
of the root block.   When clients are merging into the root built from a
reference to the last parent in the fs tree we want to be careful that
we maintain a single root block that can be spliced back into the fs
tree.   We specifically check that the root block remain within the
split/join thresholds.  If it falls out of compliance we return an error
so that it can be spliced back into the fs tree and then split/joined
with its siblings.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
b6d0a45f6d Add btree_{get,set}_parent
Add calls for working with subtrees built around references to blocks in
the last level of parents.  This will let the server farm out btree
merging work where concurrency is built around safely working with all
the items and leaves that fall under a given parent block.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
d7f8896fac Add scoutfs_btree_parent_range
Add a btree helper for finding the range of keys which are found in
leaves referenced by the last parent block when searching for a given
key.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
65c39e5f97 Item seq is max of trans and lock write_seq
Rename the item version to seq and set it to the max of the transaction
seq and the lock's write_seq.  This lets btree item merging chose a seq
at which all dirty items written in future commits must have greater
seqs.  It can drop the seqs from items written to the fs tree during
btree merging knowing that there aren't any older items out in
transactions that could be mistaken for newer items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
3c69861c03 Use core seq for lock write_seq
Rename the write_version lock field to write_seq and get it from the
core seq in the super block.

We're doing this to create a relationship between a client transaction's
seq and a lock's write_seq.  New transactions will have a greater seq
than all previously granted write locks and new write locks will have a
greater seq than all open transactions.  This will be used to resolve
ambiguities in item merging as transaction seqs are written out of order
and write locks span transactions.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:24:23 -07:00
Zach Brown
05ae756b74 Get trans seq from core seq
Get the next seq for a client transaction from the core seq in the super
block.  Remove its specific next_trans_seq field.

While making this change we switch to only using le64 in the network
message payloads, the rest of the processing now uses natural u64s.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:46:19 -07:00
Zach Brown
9051ceb6fc Add core seq to the super block
Add a new seq field to the super block which will be the source of all
incremented seqs throughout the system.  We give out incremented seqs to
callers with an atomic64_t in memory which is synced back to the super
block as we commit transactions in the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:33:30 -07:00
Zach Brown
bad1c602f9 server hold_commit returns void
When we moved to the current allocator we fixed up the server commit
path to initialize the pair of allocators as a commit is finished rather
than before it starts.  This removed all the error cases from
hold_commit.  Remove the error handling from hold_commit calls to make
the system just a bit simpler.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-01 13:32:26 -07:00
Zach Brown
cee6ad34d3 Merge pull request #42 from versity/zab/fencing_and_reclaiming
Zab/fencing and reclaiming
2021-06-01 11:12:51 -07:00
Zach Brown
38a4a56741 Stop writing to other quorum slot blocks
The core quorum work loop assumes that it has exclusive access to its
slot's quorum block.  It uniquely marks blocks it writes and verifies
the marks on read to discover if another mount has written to its slot
under the assumption that this must be a configuration error that put
two mounts in the same slot.

But the design of the leader bit in the block violates the invariant
that only a slot will write to its block.   As the server comes up and
fences previous leaders it writes to their block to clear their leader
bit.

The final hole in the design is that because we're fencing mounts, not
slots, each slot can have two mounts in play.  An active mount can be
using the slot and there can still be a persistent record of a previous
mount in the slot that crashed that needs to be fenced.

All this comes together to have the server fence an old mount in a slot
while a new mount is coming up.  The new mount sees the mark change and
freaks out and stops participating in quorum.

The fix is to rework the quorum blocks so that each slot only writes to
its own block.  Instead of the server writing to each fenced mount's
slot, it writes a fence event to its block once all previous mounts have
been fenced.  We add a bit of bookkeeping so that the server can
discover when all block leader fence operations have completed.  Each
event gets its own term so we can compare events to discover live
servers.

We get rid of the write marks and instead have an event that is written
as a quorum agent starts up and is then checked on every read to make
sure it still matches.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-31 13:10:45 -07:00
Zach Brown
76076011a2 Add scoutfs-fenced man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
bdc0282fa7 Describe fencing in the scoutfs.5 man page
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1199bac91d Fix quorum server shutdown
If the server shuts down it calls into quorum to tell it that the
server has exited.  This stops quorum from sending heartbeats that
suppress other leader elections.

The function that did this got the logic wrong.  It was setting the bit
instead of clearing it, having been initially written to set a bit when
the server exited.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
1e460e5cb0 Add scoutfs-fenced and its run scripts to spec
Install the scoutfs-fenced daemon and its run scripts in the rpm spec
file.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
877e30d60f Add client address to mounted_client item
Add the peername of the client's connected socket to its mounted_client
item as it mounts.  If the client doesn't recover then fencing can use
the IP to find the host to fence.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:39 -07:00
Zach Brown
a972e42fba Update dmesg filters for fencing and reclaim
Add regexes for the messages that come from fencing and reclaiming
resources from fenced mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
0706669047 Clean up quorum block read error messages
The error messages from reading quorum blocks were confusing.  The mark
was being checked when the block had already seen an error, and we got
multiple messages for some errors.

This cleans it up a bit so we only get one error message for each error
source and each message contains relevant context.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
76cef6fdfc Let _recov_next_pending iterate over rids
Currently the server's recovery timeout work synchronously reclaims
resources for each client whose recovery timed out.
scoutfs_recov_next_pending() can always return the head of the pending
list because its caller will always remove it from the list as it
iterates.

As we move to real fencing the server will be creating fence requests
for all the timed out clients concurrently.  It will need to iterate
over all the rids for clients in recovery.

So we sort recovery's pending list by rid and change _recov_next_pending
to return the next pending rid after a rid argument.  This lets the
server iterate over all the pending rids at once.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
aad2d3db59 Add stage_tmpfile to .gitignore
We missed adding this newly added binary to .gitignore.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
933fc687c3 omap remove_rid might not find entry
Client recovery in the server doesn't add the omap rid for all the
clients that it's waiting for.  It only adds the rid as they connect.  A
client whose recovery timeout expires and is evicted will try to have
its omap rid removed without being added.

Today this triggers a warning and returns an error from a time when the
omap rid lifecycle was more rigid.  Now that it's being called by the
server's reclaim_rid, along with a bunch of other functions that succeed
if called for non-existant clients, let's have the omap remove_rid do
the same.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
6663034295 Run the fence agent in the background of tests
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
ab5466a771 Protect server shutting down with smp barriers
I saw a confusing hang that looked like a lack of ordering between
a waker setting shutting_down and a wait event testing it after
being woken up.  Let's see if more barriers help.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
f3764b873b Save previous connected client address
Our connection state spans sockets that can disconnect and reconnect.
While sockets are connected we store the socket's remote address in the
connection's peername and we clear it as sockets disconnect.

Fencing wants to know the last connected address of the mount.  It's a
bit of metadata we know about the mount that can be used to find it and
fence it.  As we store the peer address we also stash it away as the
last known peer address for the socket.  Fencing can then use that
instead of the current socket peer address which is guaranteed to be
uninitialized because there's no socket connected.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
9ebc9d0f66 Manage client reconnect delay
The client currently always queues immediate connect work if it's
nodify_down is called.  It was assuming that notify_down is only called
from a healthy established connection.   But it's also called for
unsuccessful conneect attempts that might not have timed out.  Say the
host is up but the port isn't listening.

This results in spamming connection attempts while an old stale leader
block until a new server is elected, fences the previous leader, and
updates their quorum block.

The fix is to explicitly manage the connection work queueing delay.  We
only set it to immediately queue on mount and when we see a greeting
reply from the server.  We always set it to a longer timeout as we start
a connection attempt.  This means we'll always have a long reconnect
delay unless we really connected to a server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
8b78f701a1 Add fence-and-reclaim test
Add a test which exercises the various reasons for fencing mounts and
checks that we reclaim the resources that they had.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
1f1f40f079 Add fence agent that processes fence requests
Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
943351944a Call fencing from the server
The server is responsible for calling the fencing subsystem.  It is the
source of fencing requests as it decides that previous mounts are
unresponsive.  It is responsible for reclaiming resources for fenced
mounts and freeing their associated fence request.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:28 -07:00
Zach Brown
b060eb4f5d Add fencing subsystem
Add the subsystem which tracks pending fence requests and exposes them
to userspace for processing.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:25 -07:00
Zach Brown
2dde729791 Add sysfs create attr w/ parent
Add sysfs attribute creation that can provide the parent dir kobject
instead of always creating the sysfs object dir off of the main
per-mount dir.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:19 -07:00
Zach Brown
ccb7c0bf4b Add rw sysfs attr wrapper
Add a wrapper around __ATTR_RW so that callers can add attributes with a
_store function.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:18:07 -07:00
Zach Brown
e9d04dcf8d Add forced unmount support
Add super_ops->umount_begin so that we can implement a forced unmount
which tries to avoid issuing any more network or storage ops.  It can
return errors and lose unsynchronized data.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-26 14:02:20 -07:00
Zach Brown
5dceac32db Merge pull request #40 from versity/zab/data_alloc_zones
Zab/data alloc zones
2021-05-24 13:00:48 -07:00
Zach Brown
ef440ead28 Add -z to run-test for data-alloc-zone-blocks
Add an option to run-tests which gets passed through to the
data-alloc-zone-blocks argument for mkfs.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
d0b04e790c Add data-alloc-zone-blocks argument to mkfs
Add an argument to mkfs which sets the data_alloc_zone_blocks volume
option.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
54644a5074 Add data_alloc_zone_blocks volume option
Add the data_alloc_zone_blocks volume option.  This changes the
behaviour of the server to try and give mounts free data extents which
fall in exclusive fixed-size zones.

We add the field to the scoutfs_volume_options struct and add it to the
set_volopt server handler which enforces constrains on the size of the
zones.

We then add fields to the log_trees struct which records the size of the
zones and sets bits for the zones that contain free extents in the
data_avail allocator root.  The get_log_trees handler is changed to read
all the zone bitmaps from all the items, pass those bitmaps in to
_alloc_move to direct data allocations, and finally update the bitmaps
in the log_trees items to cover the newly allocated extents.  The
log_trees data_alloc_zone fields are cleared as the mount's logs are
reclaimed to indicate that the mount is no longer writing to the zone.

The policy mechanism of finding free extents based on the bitmaps is
ipmlemented down in _data_alloc_move().

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
52c2a465db Add zone awareness to scoutfs_alloc_move()
Add parameters so that scoutfs_alloc_move() can first search for source
extents in specified zones.  It uses relatively cheap searches through
the order items to find extents that intersect with the regions
described by the zone bitmaps.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
bc4975fad4 Add scoutfs_alloc_extents_cb()
Add an allocator call for getting a callback for all the extents in
btree items in an allocator root.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
9de3ae6dcb Index free extents by order of length
Allocators store free extents in two items, one sorted by their blkno
position and the other by their precise length.

The length index makes it easy to search for precise extent lengths, but
it makes it hard to search for a large extent within a given blkno
region.  Skipping in the blkno dimension has to be done for every
precise length value.

We don't need that level of precision.  If we index the extents by a
coarser order of the length then we have a fixed number of orders in
which we have to skip in the blkno dimension when searching within a
specific region.

This changes the length item to be stored at the log(8) order of the
length of the extents.  This groups extents into orders that are close
to the human-friendly base 10 orders of magnitude.

With this change the order field in the key no longer stores the precise
extent length.  To preserve the length of the extent we need to use
another field.  The only 64bit field remaining is the first which is a
higher comparision priority than the type.  So we use the highest
comparison priority zone field to differentiate the position and order
indexes and can now use all three 64bit fields in the key.

Finally, we have to be careful when constructing a key to use _next when
searching for a large extent.  Previously keys were relying on the magic
property that building a key from an extent length of 0 ended up at the
key value -0 = 0.  That only worked because we never stored zero length
extents.  We now store zero length orders so we can't use the negative
trick anymore.  We explicitly treat 0 length extents carefully when
building keys and we subtract the order from U64_MAX to store the orders
from largest to smallest.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:25:56 -07:00
Zach Brown
0aa6005c99 Add volume options super, server, and sysfs
Introduce global volume options.  They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-19 14:15:06 -07:00
Zach Brown
973dc4fd1c Merge pull request #38 from versity/zab/read_xattr_deadlocks
Zab/read xattr deadlocks
2021-05-03 09:44:57 -07:00
Zach Brown
a5ca5ee36d Put back-to-back invalidated locks back on list
A lock that is undergoing invalidation is put on a list of locks in the
super block.  Invalidation requests put locks on the list.  While locks
are invalidated they're temporarily put on a private list.

To support a request arriving while the lock is being processed we
carefully manage the invalidation fields in the lock between the
invalidation worker and the incoming request.  The worker correctly
noticed that a new invalidation request had arrived but it left the lock
on its private list instead of putting it back on the invalidation list
for further processing.  The lock was unreachable, wouldn't get
invalidated, and caused everyone trying to use the lock to block
indefinitely.

When the worker sees another request arrive for an invalidating lock it
needs to move the lock from the private list back to the invalidation
list.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-30 10:00:07 -07:00
Zach Brown
603af327ac Ignore I_FREEING in all inode hash lookups
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);

Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).

I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget.  We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.

Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave.  They'll get NULL instead of
waiting from ilookup.  They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.

We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.

This change does allow for concurrent iget of an inode number that is
being deleted on a local node.  This could happen in fh_to_dentry with a
raw inode number.  But this was already a problem between mounts because
they don't have a shared inode cache to serialize them.  Once we fix
that between nodes, we fix it on a single node as well.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:22:10 -07:00
Zach Brown
ca320d02cb Get i_mutex before cluster lock in file aio_read
The vfs often calls filesystem methods with i_mutex held.  This creates
a natural ordering of i_mutex outside of cluster locks.  The file
aio_read method acquired i_mutex after its cluster lock, creating a
deadlock with other vfs methods like setattr.

The acquisition of i_mutex after the cluster lock was due to using the
pattern where we use the per-task lock to discover if we're the first
user of the lock in a call chain.  Readpage has to do this, but file
aio_read doesn't.  It should never be called recursively.  So we can
acquire the i_mutex outside of the cluster lock and warn if we ever are
called recursively.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:11:06 -07:00
Zach Brown
5231cf4034 Add export-lookup-evict-race test
Add a test that creates races between fh_to_dentry and eviction
triggered by lock invalidation.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:11:06 -07:00
Andy Grover
f631058265 Merge pull request #37 from versity/zab/test_mkdir_rename_unlink
Add mkdir-rename-rmdir test
2021-04-27 13:21:27 -07:00
Zach Brown
1b4e60cae4 Add mkdir-rename-rmdir test
Add a test which performs mkdir, two renames of the dir, and rmdir on
all possible combinations of mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-27 12:01:43 -07:00
Andy Grover
6eeaab3322 Merge pull request #35 from versity/zab/invalidate_already_pending
Handle back to back invalidation requests
2021-04-23 16:40:45 -07:00
Andy Grover
ac68d14b8d Merge pull request #36 from versity/zab/move_blocks_next_einval
Fix accidental EINVAL in move_blocks
2021-04-23 14:39:29 -07:00
Zach Brown
ecfc8a0d0e Merge pull request #33 from versity/zab/open_ino_map
Zab/open ino map
2021-04-23 10:55:11 -07:00
Zach Brown
63148d426e Fix accidental EINVAL in move_blocks
When move blocks is staging it requires an overlapping offline extent to
cover the entire region to move.

It performs the stage by modifying extents at a time.  If there are
fragmented source extents it will modify each of them at a time in the
region.

When looking for the extent to match the source extent it looked from
the iblock of the start of the whole operation, not the start of the
source extent it's matching.  This meant that it would find a the first
previous online extent it just modified, which wouldn't be online, and
would return -EINVAL.

The fix is to have it search from the logical start of the extent it's
trying to match, not the start of the region.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-23 10:39:34 -07:00
Zach Brown
a27c54568c Handle back to back invalidation requests
The client's incoming lock invalidation request handler triggers a
BUG_ON if it gets a request for a lock that is already processing a
previous invalidation request.  The server is supposed to only send
one request at a time.

The problem is that the batched invalidation request handling will send
responses outside of spinlock coverage before reacquirin the lock and
finishing processing once the response send has been successful.

This gives a window for another invalidation request to arrive after the
response was sent but before the invalidation finished processing.  This
triggers the bug.

The fix is to mark the lock such that we can recognize a valid second
request arriving after we send the response but before we finish
processing.  If it arrives we'll continue invalidation processing with
the arguments from the new request.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-22 17:00:50 -07:00
Zach Brown
dfc2f7a4e8 Remove unused scoutfs_free_unused_locks nr arg
The nr argument wasn't used.  It always tries to free as many as the
shrinker call will let it.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
94dd86f762 Process lock invalidation after shutdown
Lock teardown during unmount involves first calling shutdown and then
destroy.  The shutdown call is meant to ensure that it's safe to tear
down the client network connections.  Once shutdown returns locking is
promising that it won't call into the client to send new lock requests.

The current shutdown implementation is very heavy handed and shuts down
everything.  This creates a deadlock.  After calling lock shutdown, the
client will send its farewell and wait for a response.  The server might
not send the farewell response until other mounts have unmounted if our
client is in the server's mount.  In this case we stil have to be
processing lock invalidation requests to allow other unmounting clients
to make forward progress.

This is reasonably easy and safe to do.  We only use the shutdown flag
to stop lock calls that would change lock state and send requests.  We
don't have it stop incoming requests processing in the work queueing
functions.  It's safe to keep processing incoming requests between
_shutdown and _destroy because the requests already come in through the
client.  As the client shuts down it will stop calling us.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
841d22e26e Disable task reclaim flags for block cache vmalloc
Even though we can pass in gfp flags to vmalloc it eventually calls pte
alloc functions which ignore the caller's flags and use user gfp flags.
This risks reclaim re-entering fs paths during allocations in the block
cache.  These allocs that allowed reclaim deep in the fs was causing
lockdep to add RECLAIM dependencies between locks and holler about
deadlocks.

We apply the same pattern that xfs does for disabling reclaim while
allocating vmalloced block payloads.  Setting PF_MEMALLOC_NOIO causes
reclaim in that task to clear __GFP_IO and __GFP_FS, regardless of the
individual allocation flags in the task, preventing recursion.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
ba8bf13ae1 Update dmesg whitelist for recovery
The shared recovery layer outputs different messages than when it ran
only for lock_recovery in the lock server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
2949b6063f Clear lock invalidate_pending during destroy
Locks have a bunch of state that reflects concurrent processing.
Testing that state determines when it's safe to free a lock because
nothing is going on.

During unmount we abruptly stop processing locks.  Unmount will send a
farewell to the server which will remove all the state associated with
the client that's unmounting for all its locks, regardless of the state
the locks were in.

The client unmount path has to clean up the interupted lock state and
free it, carefully avoiding assertions that would otherwise indicate
that we're freeing used locks.  The move to async lock invalidation
forgot to clean up the invalidation state.  Previously a synchronous
work function would set and clear invalidate_pending while it was
running.  Once we finished waiting for it invalidate_pending would be
clear.  The move to async invalidation work meant that we can still have
invalidate_pending with no work executing.  Lock destruction removed
locks from the invalidation list but forgot to clear the
invalidate_pending flag.

This triggered assertions during unmount that were otherwise harmless.
There was other use of the lock, we just forgot to clean up the lock
state.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
1e88aa6c0f Shutdown data after trans
The data_info struct holds the data allocator that is filled by
transactions as they commit.  We have to free it after we've shutdown
transactions.  It's more like the forest in this regard so we move its
desctruction down by the forest to group similar behaviour.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
d9aea98220 Shutdown locking before transactions
Shutting down the lock client waits for invalidation work and prevents
future work from being queued.  We're currently shutting down the
subsystems that lock calls before lock itself, leading to crashes if we
happen to have invalidations executing as we unmount.

Shutting down locking before its dependencies fixes this.  This was hit
in testing during the inode deletion fixes because it created the
perfect race by acquiring locks during unmount so that the server was
very unlikely to send invalidations on behalf to one mount on behalf of
another as they both unmounted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
04f4b8bcb3 Perform final transaction write before shutdown
Shutting down the transaction during unmount relied on the vfs unmount
path to perform a sync of any remaining dirty transaction.  There are
ways that we can dirty a transaction during unmount after it calls
the fs sync, so we try to write any remaining dirty transaction before
shutting down.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
fead263af3 Remove unused sb_info shutdown
We're no longer using the shutdown field in our sb info struct.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
4389c73c14 Fix deadlock between lock invalidate and evict
We've had a long-standing deadlock between lock invalidation and
eviction.  Invalidating a lock wants to lookup inodes and drop their
resources while blocking locks.  Eviction wants to get a lock to perform
final deletion while the inodes has I_FREEING set which blocks lookups.

We only saw this deadlock a handful of times in all of the time we've
run the code, but it's now much more common now that we're acquiring
locks in iput to test that nlink is zero instead of only when nlink is
zero.  I see unmount hang regularly when testing final inode deletion.

This adds a lookup variant for invalidation which will refuse to
return freeing inodes so they won't be waited on.  Once they're freeing
they can't be seen by future lock users so they don't need to be
invalidated.  This keeps the lock invalication promise and avoids
sleeping on freeing inodes which creates the deadlock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
dba88705f7 Fix t_umount mount point number
t_umount had a typo that had it try to unmount a mount based on a
caller's variable, which accidentally happened to work for its only
caller.  Future callers would not have been so lucky.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
715c29aad3 Proactively drop dentry/inode caches outside locks
Previously we wouldn't try and remove cached dentries and inodes as
lock revocation removed cluster lock coverage.  The next time
we tried to use the cached dentries or inodes we'd acquire
a lock and refresh them.

But now cached inodes prevent final inode deletion.  If they linger
outside cluster locking then any final deletion will need to be deferred
until all its cached inodes are naturally dropped at some point in the
future across the cluster.  It might take refreshing the dentries or for
memory pressure to push out the old cached inodes.

This tries to proctively drop cached dentries and inodes as we lose
cluster lock coverage if they're not actively referenced.  We need to be
careful not to perform final inode deletion during lock invalidation
because it will deadlock, so we defer an iput which could delete during
evict out to async work.

Now deletion can be done synchronously in the task that is performing
the unlink because previous use of the inode on remote mounts hasn't
left unused cached inodes sitting around.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
b244b2d59c Add inode-deletion test
Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
22371fe5bd Fully destroy inodes after all mounts evict
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount.  This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.

We fix this by adding cached inode tracking.  Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.

This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group.  Removing many files in a group will only lock and get
the open map once per group.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
c6fd807638 Use recov to manage lock recovery
Now that we have the recov layer we can have the lock server use it to
track lock recovery.  The lock server no longer needs its own recovery
tracking structures and can instead call recov.  We add a call for the
server to call to kick lock processing once lock recovery finishes.  We
can get rid of the persistent lock_client items now that the server is
driving recovery from the mounted_client items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
592f472a1c Use recov in server to recover client greetings
The server starts recovery when it finds mounted client items as it
starts up.  The clients are done recovering once they send their
greeting.  If they don't recover in time then they'll be fenced.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
a65775588f Add server recovery helpers
Add a little set of functions to help the server track which clients are
waiting to recover which state.  The open map messages need to wait for
recovery so we're moving recovery out of being only in the lock server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
da1af9b841 Add scoutfs inode ino lock coverage
Add lock coverage which tracks if the inode has been refreshed and is
covered by the inode group cluster lock.  This will be used by
drop_inode and evict_inode to discover that the inode is current and
doesn't need to be refreshed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
accd680a7e Fix block setup always returning 0
Another case of returning 0 instead of ret.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Andy Grover
cbb031bb5d Merge pull request #32 from versity/zab/block_rhashtable_insert_fixes
Zab/block rhashtable insert fixes
2021-04-13 10:42:17 -07:00
Zach Brown
c3290771a0 Block cache use rht _lookup_ insert for EEXIST
The sneaky rhashtable_insert_fast() can't return -EEXIST despite the
last line of the function *REALLY* making it look like it can.  It just
inserts new objects at the head of the bucket lists without comparing
the insertion with existing objects.

The block cache was relying on insertion to resolve duplicate racing
allocated blocks.  Because it couldn't return -EEXIST we could get
duplicate cached blocks present in the hash table.

rhashtable_lookup_insert_fast() fixes this by actually comparing the
inserted objects key with the objects found in the insertion bucket.  A
racing allocator trying to insert a duplicate cached block will get an
error, drop their allocated block, and retry their lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 09:24:23 -07:00
Zach Brown
cf3cb3f197 Wait for rhashtable to rehash on insert EBUSY
The rhashtable can return EBUSY if you insert fast enough to trigger an
expansion of the next table size that is waiting to be rehashed in an
rcu callback.  If we get EBUSY from rhasthable_insert we call
synchronize_rcu to wait for the rehash to complete before trying again.

This was hit in testing restores of a very large namespace and took a
few hours to hit.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 09:24:23 -07:00
Andy Grover
cb4ed98b3c Merge pull request #31 from versity/zab/block_shrink_wait_for_rebalance
Block cache shrink restart waits for rcu callbacks
2021-04-08 09:03:12 -07:00
Zach Brown
9ee7f7b9dc Block cache shrink restart waits for rcu callbacks
We're seeing cpu livelocks in block shrinking where counters show that a
single block cache shrink call is only getting EAGAIN from repeated
rhashtable walk attempts.  It occurred to me that the running task might
be preventing an RCU grace period from ending by never blocking.

The hope of this commit is that by waiting for rcu callbacks to run
we'll ensure that any pending rebalance callback runs before we retry
the rhashtable walk again.  I haven't been able to reproduce this easily
so this is a stab in the dark.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-07 12:50:50 -07:00
Zach Brown
300791ecfa Merge pull request #29 from agrover/cleanup
Cleanup
2021-04-07 12:27:00 -07:00
Andy Grover
4630b77b45 cleanup: Use flexible array members instead of 0-length arrays
See Documentation/process/deprecated.rst:217, items[] now preferred over
items[0].

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-07 10:14:47 -07:00
Andy Grover
bdc43ca634 cleanup: Fix ESTALE handling in forest_read_items
Kinda weird to goto back to the out label and then out the bottom. Just
return -EIO, like forest_next_hint() does.

Don't call client_get_roots() right before retry, since is the first thing
retry does.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-07 10:14:04 -07:00
Andy Grover
6406f05350 cleanup: Remove struct net_lock_grant_response
We're not using the roots member of this struct, so we can just
use struct scoutfs_net_lock directly.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-07 10:13:56 -07:00
Andy Grover
820b7295f0 cleanup: Unused LIST_HEADs
Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 16:23:41 -07:00
Zach Brown
b3611103ee Merge pull request #26 from agrover/tmpfile
Support O_TMPFILE and allow MOVE_BLOCKS into released extents
2021-04-05 15:23:41 -07:00
Andy Grover
0deb232d3f Support O_TMPFILE and allow MOVE_BLOCKS into released extents
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.

Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.

RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.

Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.

xfstests common/004 now runs because tmpfile is supported.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 14:23:44 -07:00
Andy Grover
1366e254f9 Merge pull request #30 from versity/zab/srch_block_ref_leak
Zab/srch block ref leak
2021-04-01 16:50:34 -07:00
90 changed files with 9345 additions and 1326 deletions

View File

@@ -18,6 +18,7 @@ scoutfs-y += \
dir.o \
export.o \
ext.o \
fence.o \
file.o \
forest.o \
inode.o \
@@ -27,9 +28,11 @@ scoutfs-y += \
lock_server.o \
msg.o \
net.o \
omap.o \
options.o \
per_task.o \
quorum.o \
recov.o \
scoutfs_trace.o \
server.o \
sort_priv.o \
@@ -40,6 +43,7 @@ scoutfs-y += \
trans.o \
triggers.o \
tseq.o \
volopt.o \
xattr.o
#

View File

@@ -29,8 +29,8 @@
* The core allocator uses extent items in btrees rooted in the super.
* Each free extent is stored in two items. The first item is indexed
* by block location and is used to merge adjacent extents when freeing.
* The second item is indexed by length and is used to find large
* extents to allocate from.
* The second item is indexed by the order of the length and is used to
* find large extents to allocate from.
*
* Free extent always consumes the front of the largest extent. This
* attempts to discourage fragmentation by given smaller freed extents
@@ -67,25 +67,52 @@
*/
/*
* Free extents don't have flags and are stored in two indexes sorted by
* block location and by length, largest first. The block location key
* is set to the final block in the extent so that we can find
* intersections by calling _next() iterators starting with the block
* we're searching for.
* Return the order of the length of a free extent, which we define as
* floor(log_8_(len)): 0..7 = 0, 8..63 = 1, etc.
*/
static void init_ext_key(struct scoutfs_key *key, int type, u64 start, u64 len)
static u64 free_extent_order(u64 len)
{
return (fls64(len | 1) - 1) / 3;
}
/*
* The smallest (non-zero) length that will be mapped to the same order
* as the given length.
*/
static u64 smallest_order_length(u64 len)
{
return 1ULL << (free_extent_order(len) * 3);
}
/*
* Free extents don't have flags and are stored in two indexes sorted by
* block location and by length order, largest first. The location key
* field is set to the final block in the extent so that we can find
* intersections by calling _next() with the start of the range we're
* searching for.
*
* We never store 0 length extents but we do build keys for searching
* the order index from 0,0 without having to map it to a real extent.
*/
static void init_ext_key(struct scoutfs_key *key, int zone, u64 start, u64 len)
{
*key = (struct scoutfs_key) {
.sk_zone = SCOUTFS_FREE_EXTENT_ZONE,
.sk_type = type,
.sk_zone = zone,
};
if (type == SCOUTFS_FREE_EXTENT_BLKNO_TYPE) {
if (len == 0) {
/* we only use 0 len extents for magic 0,0 order lookups */
WARN_ON_ONCE(zone != SCOUTFS_FREE_EXTENT_ORDER_ZONE || start != 0);
return;
}
if (zone == SCOUTFS_FREE_EXTENT_BLKNO_ZONE) {
key->skfb_end = cpu_to_le64(start + len - 1);
key->skfb_len = cpu_to_le64(len);
} else if (type == SCOUTFS_FREE_EXTENT_LEN_TYPE) {
key->skfl_neglen = cpu_to_le64(-len);
key->skfl_blkno = cpu_to_le64(start);
} else if (zone == SCOUTFS_FREE_EXTENT_ORDER_ZONE) {
key->skfo_revord = cpu_to_le64(U64_MAX - free_extent_order(len));
key->skfo_end = cpu_to_le64(start + len - 1);
key->skfo_len = cpu_to_le64(len);
} else {
BUG();
}
@@ -93,23 +120,27 @@ static void init_ext_key(struct scoutfs_key *key, int type, u64 start, u64 len)
static void ext_from_key(struct scoutfs_extent *ext, struct scoutfs_key *key)
{
if (key->sk_type == SCOUTFS_FREE_EXTENT_BLKNO_TYPE) {
if (key->sk_zone == SCOUTFS_FREE_EXTENT_BLKNO_ZONE) {
ext->start = le64_to_cpu(key->skfb_end) -
le64_to_cpu(key->skfb_len) + 1;
ext->len = le64_to_cpu(key->skfb_len);
} else {
ext->start = le64_to_cpu(key->skfl_blkno);
ext->len = -le64_to_cpu(key->skfl_neglen);
ext->start = le64_to_cpu(key->skfo_end) -
le64_to_cpu(key->skfo_len) + 1;
ext->len = le64_to_cpu(key->skfo_len);
}
ext->map = 0;
ext->flags = 0;
/* we never store 0 length extents */
WARN_ON_ONCE(ext->len == 0);
}
struct alloc_ext_args {
struct scoutfs_alloc *alloc;
struct scoutfs_block_writer *wri;
struct scoutfs_alloc_root *root;
int type;
int zone;
};
static int alloc_ext_next(struct super_block *sb, void *arg,
@@ -120,13 +151,13 @@ static int alloc_ext_next(struct super_block *sb, void *arg,
struct scoutfs_key key;
int ret;
init_ext_key(&key, args->type, start, len);
init_ext_key(&key, args->zone, start, len);
ret = scoutfs_btree_next(sb, &args->root->root, &key, &iref);
if (ret == 0) {
if (iref.val_len != 0)
ret = -EIO;
else if (iref.key->sk_type != args->type)
else if (iref.key->sk_zone != args->zone)
ret = -ENOENT;
else
ext_from_key(ext, iref.key);
@@ -139,19 +170,19 @@ static int alloc_ext_next(struct super_block *sb, void *arg,
return ret;
}
static int other_type(int type)
static int other_zone(int zone)
{
if (type == SCOUTFS_FREE_EXTENT_BLKNO_TYPE)
return SCOUTFS_FREE_EXTENT_LEN_TYPE;
else if (type == SCOUTFS_FREE_EXTENT_LEN_TYPE)
return SCOUTFS_FREE_EXTENT_BLKNO_TYPE;
if (zone == SCOUTFS_FREE_EXTENT_BLKNO_ZONE)
return SCOUTFS_FREE_EXTENT_ORDER_ZONE;
else if (zone == SCOUTFS_FREE_EXTENT_ORDER_ZONE)
return SCOUTFS_FREE_EXTENT_BLKNO_ZONE;
else
BUG();
}
/*
* Insert an extent along with its matching item which is indexed by
* opposite of its len or blkno. If we succeed we update the root's
* opposite of its order or blkno. If we succeed we update the root's
* record of the total length of all the stored extents.
*/
static int alloc_ext_insert(struct super_block *sb, void *arg,
@@ -167,8 +198,8 @@ static int alloc_ext_insert(struct super_block *sb, void *arg,
if (WARN_ON_ONCE(map || flags))
return -EINVAL;
init_ext_key(&key, args->type, start, len);
init_ext_key(&other, other_type(args->type), start, len);
init_ext_key(&key, args->zone, start, len);
init_ext_key(&other, other_zone(args->zone), start, len);
ret = scoutfs_btree_insert(sb, args->alloc, args->wri,
&args->root->root, &key, NULL, 0);
@@ -196,8 +227,8 @@ static int alloc_ext_remove(struct super_block *sb, void *arg,
int ret;
int err;
init_ext_key(&key, args->type, start, len);
init_ext_key(&other, other_type(args->type), start, len);
init_ext_key(&key, args->zone, start, len);
init_ext_key(&other, other_zone(args->zone), start, len);
ret = scoutfs_btree_delete(sb, args->alloc, args->wri,
&args->root->root, &key);
@@ -619,7 +650,7 @@ int scoutfs_dalloc_return_cached(struct super_block *sb,
.alloc = alloc,
.wri = wri,
.root = &dalloc->root,
.type = SCOUTFS_FREE_EXTENT_BLKNO_TYPE,
.zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE,
};
int ret = 0;
@@ -645,6 +676,14 @@ int scoutfs_dalloc_return_cached(struct super_block *sb,
*
* Unlike meta allocations, the caller is expected to serialize
* allocations from the root.
*
* ENOBUFS is returned if the data allocator ran out of space and we can
* probably refill it from the server. The caller is expected to back
* out, commit the transaction, and try again.
*
* ENOSPC is returned if the data allocator ran out of space but we have
* a flag from the server telling us that there's no more space
* available. This is a hard error and should be returned.
*/
int scoutfs_alloc_data(struct super_block *sb, struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
@@ -655,7 +694,7 @@ int scoutfs_alloc_data(struct super_block *sb, struct scoutfs_alloc *alloc,
.alloc = alloc,
.wri = wri,
.root = &dalloc->root,
.type = SCOUTFS_FREE_EXTENT_LEN_TYPE,
.zone = SCOUTFS_FREE_EXTENT_ORDER_ZONE,
};
struct scoutfs_extent ext;
u64 len;
@@ -693,13 +732,13 @@ int scoutfs_alloc_data(struct super_block *sb, struct scoutfs_alloc *alloc,
ret = 0;
out:
if (ret < 0) {
/*
* Special retval meaning there wasn't space to alloc from
* this txn. Doesn't mean filesystem is completely full.
* Maybe upper layers want to try again.
*/
if (ret == -ENOENT)
ret = -ENOBUFS;
if (ret == -ENOENT) {
if (le32_to_cpu(dalloc->root.flags) & SCOUTFS_ALLOC_FLAG_LOW)
ret = -ENOSPC;
else
ret = -ENOBUFS;
}
*blkno_ret = 0;
*count_ret = 0;
} else {
@@ -728,7 +767,7 @@ int scoutfs_free_data(struct super_block *sb, struct scoutfs_alloc *alloc,
.alloc = alloc,
.wri = wri,
.root = root,
.type = SCOUTFS_FREE_EXTENT_BLKNO_TYPE,
.zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE,
};
int ret;
@@ -741,6 +780,95 @@ int scoutfs_free_data(struct super_block *sb, struct scoutfs_alloc *alloc,
return ret;
}
/*
* Return the first zone bit that the extent intersects with.
*/
static int first_extent_zone(struct scoutfs_extent *ext, __le64 *zones, u64 zone_blocks)
{
int first;
int last;
int nr;
first = div64_u64(ext->start, zone_blocks);
last = div64_u64(ext->start + ext->len - 1, zone_blocks);
nr = find_next_bit_le(zones, SCOUTFS_DATA_ALLOC_MAX_ZONES, first);
if (nr <= last)
return nr;
return SCOUTFS_DATA_ALLOC_MAX_ZONES;
}
/*
* Find an extent in specific zones to satisfy an allocation. We use
* the order items to search for the largest extent that intersects with
* the zones whose bits are set in the caller's bitmap.
*/
static int find_zone_extent(struct super_block *sb, struct scoutfs_alloc_root *root,
__le64 *zones, u64 zone_blocks,
struct scoutfs_extent *found_ret, u64 count,
struct scoutfs_extent *ext_ret)
{
struct alloc_ext_args args = {
.root = root,
.zone = SCOUTFS_FREE_EXTENT_ORDER_ZONE,
};
struct scoutfs_extent found;
struct scoutfs_extent ext;
u64 start;
u64 len;
int nr;
int ret;
/* don't bother when there are no bits set */
if (find_next_bit_le(zones, SCOUTFS_DATA_ALLOC_MAX_ZONES, 0) ==
SCOUTFS_DATA_ALLOC_MAX_ZONES)
return -ENOENT;
/* start searching for largest extent from the first zone */
len = smallest_order_length(SCOUTFS_BLOCK_SM_MAX);
nr = 0;
for (;;) {
/* search for extents in the next zone at our order */
nr = find_next_bit_le(zones, SCOUTFS_DATA_ALLOC_MAX_ZONES, nr);
if (nr >= SCOUTFS_DATA_ALLOC_MAX_ZONES) {
/* wrap down to next smaller order if we run out of bits */
len >>= 3;
if (len == 0) {
ret = -ENOENT;
break;
}
nr = find_next_bit_le(zones, SCOUTFS_DATA_ALLOC_MAX_ZONES, 0);
}
start = (u64)nr * zone_blocks;
ret = scoutfs_ext_next(sb, &alloc_ext_ops, &args, start, len, &found);
if (ret < 0)
break;
/* see if the next extent intersects any zones */
nr = first_extent_zone(&found, zones, zone_blocks);
if (nr < SCOUTFS_DATA_ALLOC_MAX_ZONES) {
start = (u64)nr * zone_blocks;
ext.start = max(start, found.start);
ext.len = min(count, found.start + found.len - ext.start);
*found_ret = found;
*ext_ret = ext;
ret = 0;
break;
}
/* continue searching past extent */
nr = div64_u64(found.start + found.len - 1, zone_blocks) + 1;
len = smallest_order_length(found.len);
}
return ret;
}
/*
* Move extent items adding up to the requested total length from the
@@ -751,6 +879,11 @@ int scoutfs_free_data(struct super_block *sb, struct scoutfs_alloc *alloc,
* -ENOENT is returned if we run out of extents in the source tree
* before moving the total.
*
* The caller can specify that extents in the source tree should first
* be found based on their zone bitmaps. We'll first try to find
* extents in the exclusive zones, then vacant zones, and then we'll
* fall back to normal allocation that ignores zones.
*
* This first pass is not optimal because it performs full btree walks
* per extent. We could optimize this with more clever btree item
* manipulation functions which can iterate through src and dst blocks
@@ -759,32 +892,77 @@ int scoutfs_free_data(struct super_block *sb, struct scoutfs_alloc *alloc,
int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_alloc_root *dst,
struct scoutfs_alloc_root *src, u64 total)
struct scoutfs_alloc_root *src, u64 total,
__le64 *exclusive, __le64 *vacant, u64 zone_blocks)
{
struct alloc_ext_args args = {
.alloc = alloc,
.wri = wri,
};
struct scoutfs_extent found;
struct scoutfs_extent ext;
u64 moved = 0;
u64 count;
int ret = 0;
int err;
if (zone_blocks == 0) {
exclusive = NULL;
vacant = NULL;
}
while (moved < total) {
args.root = src;
args.type = SCOUTFS_FREE_EXTENT_LEN_TYPE;
ret = scoutfs_ext_alloc(sb, &alloc_ext_ops, &args,
0, 0, total - moved, &ext);
count = total - moved;
if (exclusive) {
/* first try to find extents in our exclusive zones */
ret = find_zone_extent(sb, src, exclusive, zone_blocks,
&found, count, &ext);
if (ret == -ENOENT) {
exclusive = NULL;
continue;
}
} else if (vacant) {
/* then try to find extents in vacant zones */
ret = find_zone_extent(sb, src, vacant, zone_blocks,
&found, count, &ext);
if (ret == -ENOENT) {
vacant = NULL;
continue;
}
} else {
/* otherwise fall back to finding extents anywhere */
args.root = src;
args.zone = SCOUTFS_FREE_EXTENT_ORDER_ZONE;
ret = scoutfs_ext_next(sb, &alloc_ext_ops, &args, 0, 0, &found);
if (ret == 0) {
ext.start = found.start;
ext.len = min(count, found.len);
}
}
if (ret < 0)
break;
/* searching set start/len, finish initializing alloced extent */
ext.map = found.map ? ext.start - found.start + found.map : 0;
ext.flags = found.flags;
/* remove the allocation from the found extent */
args.root = src;
args.zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE;
ret = scoutfs_ext_remove(sb, &alloc_ext_ops, &args, ext.start, ext.len);
if (ret < 0)
break;
/* insert the allocated extent into the dest */
args.root = dst;
args.type = SCOUTFS_FREE_EXTENT_BLKNO_TYPE;
args.zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE;
ret = scoutfs_ext_insert(sb, &alloc_ext_ops, &args, ext.start,
ext.len, ext.map, ext.flags);
if (ret < 0) {
/* and put it back in src if insertion failed */
args.root = src;
args.type = SCOUTFS_FREE_EXTENT_BLKNO_TYPE;
args.zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE;
err = scoutfs_ext_insert(sb, &alloc_ext_ops, &args,
ext.start, ext.len, ext.map,
ext.flags);
@@ -852,7 +1030,7 @@ out:
* a list block and all the btree blocks that store extent items.
*
* At most, an extent operation can dirty down three paths of the tree
* to modify a blkno item and two distant len items. We can grow and
* to modify a blkno item and two distant order items. We can grow and
* split the root, and then those three paths could share blocks but each
* modify two leaf blocks.
*/
@@ -901,7 +1079,7 @@ int scoutfs_alloc_fill_list(struct super_block *sb,
.alloc = alloc,
.wri = wri,
.root = root,
.type = SCOUTFS_FREE_EXTENT_LEN_TYPE,
.zone = SCOUTFS_FREE_EXTENT_ORDER_ZONE,
};
struct scoutfs_alloc_list_block *lblk;
struct scoutfs_block *bl = NULL;
@@ -958,7 +1136,7 @@ int scoutfs_alloc_empty_list(struct super_block *sb,
.alloc = alloc,
.wri = wri,
.root = root,
.type = SCOUTFS_FREE_EXTENT_BLKNO_TYPE,
.zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE,
};
struct scoutfs_alloc_list_block *lblk = NULL;
struct scoutfs_block *bl = NULL;
@@ -1091,6 +1269,20 @@ bool scoutfs_alloc_meta_low(struct super_block *sb,
return lo;
}
bool scoutfs_alloc_test_flag(struct super_block *sb,
struct scoutfs_alloc *alloc, u32 flag)
{
unsigned int seq;
bool set;
do {
seq = read_seqbegin(&alloc->seqlock);
set = !!(le32_to_cpu(alloc->avail.flags) & flag);
} while (read_seqretry(&alloc->seqlock, seq));
return set;
}
/*
* Call the callers callback for every persistent allocator structure
* we can find.
@@ -1102,9 +1294,15 @@ int scoutfs_alloc_foreach(struct super_block *sb,
struct scoutfs_block_ref refs[2] = {{0,}};
struct scoutfs_super_block *super = NULL;
struct scoutfs_srch_compact *sc;
struct scoutfs_log_merge_request *lmreq;
struct scoutfs_log_merge_complete *lmcomp;
struct scoutfs_log_trees lt;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key key;
int expected;
u64 avail_tot;
u64 freed_tot;
u64 id;
int ret;
super = kmalloc(sizeof(struct scoutfs_super_block), GFP_NOFS);
@@ -1211,6 +1409,57 @@ retry:
scoutfs_key_inc(&key);
}
/* log merge allocators */
memset(&key, 0, sizeof(key));
key.sk_zone = SCOUTFS_LOG_MERGE_REQUEST_ZONE;
expected = sizeof(*lmreq);
id = 0;
avail_tot = 0;
freed_tot = 0;
for (;;) {
ret = scoutfs_btree_next(sb, &super->log_merge, &key, &iref);
if (ret == 0) {
if (iref.key->sk_zone != key.sk_zone) {
ret = -ENOENT;
} else if (iref.val_len == expected) {
key = *iref.key;
if (key.sk_zone == SCOUTFS_LOG_MERGE_REQUEST_ZONE) {
lmreq = iref.val;
id = le64_to_cpu(lmreq->rid);
avail_tot = le64_to_cpu(lmreq->meta_avail.total_nr);
freed_tot = le64_to_cpu(lmreq->meta_freed.total_nr);
} else {
lmcomp = iref.val;
id = le64_to_cpu(lmcomp->rid);
avail_tot = le64_to_cpu(lmcomp->meta_avail.total_nr);
freed_tot = le64_to_cpu(lmcomp->meta_freed.total_nr);
}
} else {
ret = -EIO;
}
scoutfs_btree_put_iref(&iref);
}
if (ret == -ENOENT) {
if (key.sk_zone == SCOUTFS_LOG_MERGE_REQUEST_ZONE) {
memset(&key, 0, sizeof(key));
key.sk_zone = SCOUTFS_LOG_MERGE_COMPLETE_ZONE;
expected = sizeof(*lmcomp);
continue;
}
break;
}
if (ret < 0)
goto out;
ret = cb(sb, arg, SCOUTFS_ALLOC_OWNER_LOG_MERGE, id, true, true, avail_tot) ?:
cb(sb, arg, SCOUTFS_ALLOC_OWNER_LOG_MERGE, id, true, false, freed_tot);
if (ret < 0)
goto out;
scoutfs_key_inc(&key);
}
ret = 0;
out:
if (ret == -ESTALE) {
@@ -1227,3 +1476,63 @@ out:
kfree(sc);
return ret;
}
struct foreach_cb_args {
scoutfs_alloc_extent_cb_t cb;
void *cb_arg;
};
static int alloc_btree_extent_item_cb(struct super_block *sb, struct scoutfs_key *key,
void *val, int val_len, void *arg)
{
struct foreach_cb_args *cba = arg;
struct scoutfs_extent ext;
if (key->sk_zone != SCOUTFS_FREE_EXTENT_BLKNO_ZONE)
return -ENOENT;
ext_from_key(&ext, key);
cba->cb(sb, cba->cb_arg, &ext);
return 0;
}
/*
* Call the caller's callback on each extent stored in the allocator's
* btree. The callback sees extents called in order by starting blkno.
*/
int scoutfs_alloc_extents_cb(struct super_block *sb, struct scoutfs_alloc_root *root,
scoutfs_alloc_extent_cb_t cb, void *cb_arg)
{
struct foreach_cb_args cba = {
.cb = cb,
.cb_arg = cb_arg,
};
struct scoutfs_key start;
struct scoutfs_key end;
struct scoutfs_key key;
int ret;
init_ext_key(&key, SCOUTFS_FREE_EXTENT_BLKNO_ZONE, 0, 1);
for (;;) {
/* will stop at order items before getting stuck in final block */
BUILD_BUG_ON(SCOUTFS_FREE_EXTENT_BLKNO_ZONE > SCOUTFS_FREE_EXTENT_ORDER_ZONE);
init_ext_key(&start, SCOUTFS_FREE_EXTENT_BLKNO_ZONE, 0, 1);
init_ext_key(&end, SCOUTFS_FREE_EXTENT_ORDER_ZONE, 0, 1);
ret = scoutfs_btree_read_items(sb, &root->root, &key, &start, &end,
alloc_btree_extent_item_cb, &cba);
if (ret < 0 || end.sk_zone != SCOUTFS_FREE_EXTENT_BLKNO_ZONE) {
if (ret == -ENOENT)
ret = 0;
break;
}
key = end;
scoutfs_key_inc(&key);
}
return ret;
}

View File

@@ -38,6 +38,10 @@
#define SCOUTFS_ALLOC_DATA_LG_THRESH \
(8ULL * 1024 * 1024 >> SCOUTFS_BLOCK_SM_SHIFT)
/* the client will force commits if data allocators get too low */
#define SCOUTFS_ALLOC_DATA_REFILL_THRESH \
((256ULL * 1024 * 1024) >> SCOUTFS_BLOCK_SM_SHIFT)
/*
* Fill client alloc roots to the target when they fall below the lo
* threshold.
@@ -55,15 +59,16 @@
#define SCOUTFS_SERVER_DATA_FILL_LO \
(1ULL * 1024 * 1024 * 1024 >> SCOUTFS_BLOCK_SM_SHIFT)
/*
* Each of the server meta_alloc roots will try to keep a minimum amount
* of free blocks. The server will swap roots when its current avail
* falls below the threshold while the freed root is still above it. It
* must have room for all the largest allocation attempted in a
* transaction on the server.
* Log merge meta allocations are only used for one request and will
* never use more than the dirty limit.
*/
#define SCOUTFS_SERVER_META_ALLOC_MIN \
(SCOUTFS_SERVER_META_FILL_TARGET * 2)
#define SCOUTFS_LOG_MERGE_DIRTY_BYTE_LIMIT (64ULL * 1024 * 1024)
/* a few extra blocks for alloc blocks */
#define SCOUTFS_SERVER_MERGE_FILL_TARGET \
((SCOUTFS_LOG_MERGE_DIRTY_BYTE_LIMIT >> SCOUTFS_BLOCK_LG_SHIFT) + 4)
#define SCOUTFS_SERVER_MERGE_FILL_LO SCOUTFS_SERVER_MERGE_FILL_TARGET
/*
* A run-time use of a pair of persistent avail/freed roots as a
@@ -125,7 +130,8 @@ int scoutfs_free_data(struct super_block *sb, struct scoutfs_alloc *alloc,
int scoutfs_alloc_move(struct super_block *sb, struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_alloc_root *dst,
struct scoutfs_alloc_root *src, u64 total);
struct scoutfs_alloc_root *src, u64 total,
__le64 *exclusive, __le64 *vacant, u64 zone_blocks);
int scoutfs_alloc_fill_list(struct super_block *sb,
struct scoutfs_alloc *alloc,
@@ -146,6 +152,8 @@ int scoutfs_alloc_splice_list(struct super_block *sb,
bool scoutfs_alloc_meta_low(struct super_block *sb,
struct scoutfs_alloc *alloc, u32 nr);
bool scoutfs_alloc_test_flag(struct super_block *sb,
struct scoutfs_alloc *alloc, u32 flag);
typedef int (*scoutfs_alloc_foreach_cb_t)(struct super_block *sb, void *arg,
int owner, u64 id,
@@ -153,4 +161,9 @@ typedef int (*scoutfs_alloc_foreach_cb_t)(struct super_block *sb, void *arg,
int scoutfs_alloc_foreach(struct super_block *sb,
scoutfs_alloc_foreach_cb_t cb, void *arg);
typedef void (*scoutfs_alloc_extent_cb_t)(struct super_block *sb, void *cb_arg,
struct scoutfs_extent *ext);
int scoutfs_alloc_extents_cb(struct super_block *sb, struct scoutfs_alloc_root *root,
scoutfs_alloc_extent_cb_t cb, void *cb_arg);
#endif

View File

@@ -128,6 +128,7 @@ static __le32 block_calc_crc(struct scoutfs_block_header *hdr, u32 size)
static struct block_private *block_alloc(struct super_block *sb, u64 blkno)
{
struct block_private *bp;
unsigned int noio_flags;
/*
* If we had multiple blocks per page we'd need to be a little
@@ -147,8 +148,19 @@ static struct block_private *block_alloc(struct super_block *sb, u64 blkno)
set_bit(BLOCK_BIT_PAGE_ALLOC, &bp->bits);
bp->bl.data = page_address(bp->page);
} else {
bp->virt = __vmalloc(SCOUTFS_BLOCK_LG_SIZE,
GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
/*
* __vmalloc doesn't pass the gfp flags down to pte
* allocs, they're done with user alloc flags.
* Unfortunately, some lockdep doesn't know that
* PF_NOMEMALLOC prevents __GFP_FS reclaim and generates
* spurious reclaim-on dependencies and warnings.
*/
lockdep_off();
noio_flags = memalloc_noio_save();
bp->virt = __vmalloc(SCOUTFS_BLOCK_LG_SIZE, GFP_NOFS | __GFP_HIGHMEM, PAGE_KERNEL);
memalloc_noio_restore(noio_flags);
lockdep_on();
if (!bp->virt) {
kfree(bp);
bp = NULL;
@@ -188,7 +200,9 @@ static void block_free(struct super_block *sb, struct block_private *bp)
else
BUG();
WARN_ON_ONCE(!list_empty(&bp->dirty_entry));
/* ok to tear down dirty blocks when forcing unmount */
WARN_ON_ONCE(!scoutfs_forcing_unmount(sb) && !list_empty(&bp->dirty_entry));
WARN_ON_ONCE(atomic_read(&bp->refcount));
WARN_ON_ONCE(atomic_read(&bp->io_count));
kfree(bp);
@@ -286,10 +300,16 @@ static int block_insert(struct super_block *sb, struct block_private *bp)
WARN_ON_ONCE(atomic_read(&bp->refcount) & BLOCK_REF_INSERTED);
retry:
atomic_add(BLOCK_REF_INSERTED, &bp->refcount);
ret = rhashtable_insert_fast(&binf->ht, &bp->ht_head, block_ht_params);
ret = rhashtable_lookup_insert_fast(&binf->ht, &bp->ht_head, block_ht_params);
if (ret < 0) {
atomic_sub(BLOCK_REF_INSERTED, &bp->refcount);
if (ret == -EBUSY) {
/* wait for pending rebalance to finish */
synchronize_rcu();
goto retry;
}
} else {
atomic_inc(&binf->total_inserted);
TRACE_BLOCK(insert, bp);
@@ -467,6 +487,9 @@ static int block_submit_bio(struct super_block *sb, struct block_private *bp,
sector_t sector;
int ret = 0;
if (scoutfs_forcing_unmount(sb))
return -EIO;
sector = bp->bl.blkno << (SCOUTFS_BLOCK_LG_SHIFT - 9);
WARN_ON_ONCE(bp->bl.blkno == U64_MAX);
@@ -1074,10 +1097,11 @@ restart:
if (bp == NULL)
break;
if (bp == ERR_PTR(-EAGAIN)) {
/* hard reset to not hold rcu grace period across retries */
/* hard exit to wait for rcu rebalance to finish */
rhashtable_walk_stop(&iter);
rhashtable_walk_exit(&iter);
scoutfs_inc_counter(sb, block_cache_shrink_restart);
synchronize_rcu();
goto restart;
}
@@ -1129,7 +1153,7 @@ static void sm_block_bio_end_io(struct bio *bio, int err)
* only layer that sees the full block buffer so we pass the calculated
* crc to the caller for them to check in their context.
*/
static int sm_block_io(struct block_device *bdev, int rw, u64 blkno,
static int sm_block_io(struct super_block *sb, struct block_device *bdev, int rw, u64 blkno,
struct scoutfs_block_header *hdr, size_t len,
__le32 *blk_crc)
{
@@ -1141,6 +1165,9 @@ static int sm_block_io(struct block_device *bdev, int rw, u64 blkno,
BUILD_BUG_ON(PAGE_SIZE < SCOUTFS_BLOCK_SM_SIZE);
if (scoutfs_forcing_unmount(sb))
return -EIO;
if (WARN_ON_ONCE(len > SCOUTFS_BLOCK_SM_SIZE) ||
WARN_ON_ONCE(!(rw & WRITE) && !blk_crc))
return -EINVAL;
@@ -1193,14 +1220,14 @@ int scoutfs_block_read_sm(struct super_block *sb,
struct scoutfs_block_header *hdr, size_t len,
__le32 *blk_crc)
{
return sm_block_io(bdev, READ, blkno, hdr, len, blk_crc);
return sm_block_io(sb, bdev, READ, blkno, hdr, len, blk_crc);
}
int scoutfs_block_write_sm(struct super_block *sb,
struct block_device *bdev, u64 blkno,
struct scoutfs_block_header *hdr, size_t len)
{
return sm_block_io(bdev, WRITE, blkno, hdr, len, NULL);
return sm_block_io(sb, bdev, WRITE, blkno, hdr, len, NULL);
}
int scoutfs_block_setup(struct super_block *sb)
@@ -1238,7 +1265,7 @@ out:
if (ret)
scoutfs_block_destroy(sb);
return 0;
return ret;
}
void scoutfs_block_destroy(struct super_block *sb)

View File

@@ -83,6 +83,10 @@ enum btree_walk_flags {
BTW_ALLOC = (1 << 3), /* allocate a new block for 0 ref, requires dirty */
BTW_INSERT = (1 << 4), /* walking to insert, try splitting */
BTW_DELETE = (1 << 5), /* walking to delete, try joining */
BTW_PAR_RNG = (1 << 6), /* return range through final parent */
BTW_GET_PAR = (1 << 7), /* get reference to final parent */
BTW_SET_PAR = (1 << 8), /* override reference to final parent */
BTW_SUBTREE = (1 << 9), /* root is parent subtree, return -ERANGE if split/join */
};
/* total length of the value payload */
@@ -104,16 +108,22 @@ static inline unsigned int item_bytes(struct scoutfs_btree_item *item)
}
/*
* Join blocks when they both are 1/4 full. This puts some distance
* between the join threshold and the full threshold for splitting.
* Blocks that just split or joined need to undergo a reasonable amount
* of item modification before they'll split or join again.
* Refill blocks from their siblings when they're under 1/4 full. This
* puts some distance between the join threshold and the full threshold
* for splitting. Blocks that just split or joined need to undergo a
* reasonable amount of item modification before they'll split or join
* again.
*/
static unsigned int join_low_watermark(void)
{
return (SCOUTFS_BLOCK_LG_SIZE - sizeof(struct scoutfs_btree_block)) / 4;
}
static bool total_above_join_low_water(struct scoutfs_btree_block *bt)
{
return le16_to_cpu(bt->total_item_bytes) >= join_low_watermark();
}
/*
* return the integer percentages of total space the block could have
* consumed by items that is currently consumed.
@@ -512,6 +522,7 @@ static void create_item(struct scoutfs_btree_block *bt,
item->val_off = insert_value(bt, ptr_off(bt, item), val, val_len);
item->val_len = cpu_to_le16(val_len);
memset(item->__pad, 0, sizeof(item->__pad));
le16_add_cpu(&bt->total_item_bytes, item_bytes(item));
}
@@ -805,12 +816,13 @@ static int try_join(struct super_block *sb,
struct scoutfs_btree_block *sib;
struct scoutfs_block *sib_bl;
struct scoutfs_block_ref *ref;
const unsigned int lwm = join_low_watermark();
unsigned int sib_tot;
bool move_right;
int to_move;
int ret;
if (le16_to_cpu(bt->total_item_bytes) >= join_low_watermark())
if (total_above_join_low_water(bt))
return 0;
scoutfs_inc_counter(sb, btree_join);
@@ -830,18 +842,23 @@ static int try_join(struct super_block *sb,
return ret;
sib = sib_bl->data;
sib_tot = le16_to_cpu(bt->total_item_bytes);
if (sib_tot < join_low_watermark())
/* combine if resulting block would be up to 75% full, move big chunk otherwise */
sib_tot = le16_to_cpu(sib->total_item_bytes);
if (sib_tot <= lwm * 2)
to_move = sib_tot;
else
to_move = sib_tot - join_low_watermark();
to_move = lwm;
if (le16_to_cpu(bt->mid_free_len) < to_move) {
/* compact to make room for over-estimate of worst case move overrun */
if (le16_to_cpu(bt->mid_free_len) <
(to_move + item_len_bytes(SCOUTFS_BTREE_MAX_VAL_LEN))) {
ret = compact_values(sb, bt);
if (ret < 0)
if (ret < 0) {
scoutfs_block_put(sb, sib_bl);
return ret;
return ret;
}
}
move_items(bt, sib, move_right, to_move);
/* update our parent's item */
@@ -904,20 +921,21 @@ static bool bad_avl_node_off(__le16 node_off, int nr)
* - call after leaf modification
* - padding is zero
*/
static void verify_btree_block(struct super_block *sb,
__attribute__((unused))
static void verify_btree_block(struct super_block *sb, char *str,
struct scoutfs_btree_block *bt, int level,
struct scoutfs_key *start,
bool last_ref, struct scoutfs_key *start,
struct scoutfs_key *end)
{
__le16 *buckets = leaf_item_hash_buckets(bt);
struct scoutfs_btree_item *item;
struct scoutfs_avl_node *node;
char *reason = NULL;
int first_val = 0;
int hashed = 0;
int end_off;
int tot = 0;
int i = 0;
int j = 0;
int nr;
if (bt->level != level) {
@@ -956,8 +974,9 @@ static void verify_btree_block(struct super_block *sb,
goto out;
}
for (j = 0; j < sizeof(item->__pad); j++) {
WARN_ON_ONCE(item->__pad[j] != 0);
if (memchr_inv(item->__pad, '\0', sizeof(item->__pad))) {
reason = "item struct __pad isn't zero";
goto out;
}
if (scoutfs_key_compare(&item->key, start) < 0 ||
@@ -972,19 +991,29 @@ static void verify_btree_block(struct super_block *sb,
goto out;
}
if (level > 0 && le16_to_cpu(item->val_len) !=
sizeof(struct scoutfs_block_ref)) {
reason = "parent item val not sizeof ref";
goto out;
}
if (le16_to_cpu(item->val_len) > SCOUTFS_BTREE_MAX_VAL_LEN) {
reason = "bad item val len";
goto out;
}
if (le16_to_cpu(item->val_off) % SCOUTFS_BTREE_VALUE_ALIGN) {
reason = "item value not aligned";
goto out;
}
if (((int)le16_to_cpu(item->val_off) +
le16_to_cpu(item->val_len)) > end_off) {
reason = "item value outside valid";
goto out;
}
tot += sizeof(struct scoutfs_btree_item) +
le16_to_cpu(item->val_len);
tot += item_len_bytes(le16_to_cpu(item->val_len));
if (item->val_len != 0) {
first_val = min_t(int, first_val,
@@ -992,6 +1021,15 @@ static void verify_btree_block(struct super_block *sb,
}
}
if (last_ref && level > 0 &&
(node = scoutfs_avl_last(&bt->item_root)) != NULL) {
item = node_item(node);
if (scoutfs_key_compare(&item->key, end) != 0) {
reason = "final ref item key not range end";
goto out;
}
}
for (i = 0; level == 0 && i < SCOUTFS_BTREE_LEAF_ITEM_HASH_NR; i++) {
if (buckets[i] == 0)
continue;
@@ -1024,17 +1062,18 @@ out:
if (!reason)
return;
printk("found btree block inconsistency: %s\n", reason);
printk("start "SK_FMT" end "SK_FMT"\n", SK_ARG(start), SK_ARG(end));
printk("verifying btree %s: %s\n", str, reason);
printk("args: level %u last_ref %u start "SK_FMT" end "SK_FMT"\n",
level, last_ref, SK_ARG(start), SK_ARG(end));
printk("calced: i %u tot %u hashed %u fv %u\n",
i, tot, hashed, first_val);
printk("hdr: crc %x magic %x fsid %llx seq %llx blkno %llu\n",
printk("bt hdr: crc %x magic %x fsid %llx seq %llx blkno %llu\n",
le32_to_cpu(bt->hdr.crc), le32_to_cpu(bt->hdr.magic),
le64_to_cpu(bt->hdr.fsid), le64_to_cpu(bt->hdr.seq),
le64_to_cpu(bt->hdr.blkno));
printk("item_root: node %u\n", le16_to_cpu(bt->item_root.node));
printk("nr %u tib %u mfl %u lvl %u\n",
printk("bt: nr %u tib %u mfl %u lvl %u\n",
le16_to_cpu(bt->nr_items), le16_to_cpu(bt->total_item_bytes),
le16_to_cpu(bt->mid_free_len), bt->level);
@@ -1051,6 +1090,92 @@ out:
BUG();
}
/*
* Walk from the root to the leaf, verifying the blocks traversed.
*/
__attribute__((unused))
static void verify_btree_walk(struct super_block *sb, char *str,
struct scoutfs_btree_root *root,
struct scoutfs_key *key)
{
struct scoutfs_avl_node *next_node;
struct scoutfs_avl_node *node;
struct scoutfs_btree_item *item;
struct scoutfs_btree_item *prev;
struct scoutfs_block *bl = NULL;
struct scoutfs_btree_block *bt;
struct scoutfs_block_ref ref;
struct scoutfs_key start;
struct scoutfs_key end;
bool last_ref;
int level;
int ret;
if (root->height == 0 && root->ref.blkno != 0) {
WARN_ONCE(1, "invalid btree root height %u blkno %llu seq %016llx\n",
root->height, le64_to_cpu(root->ref.blkno),
le64_to_cpu(root->ref.seq));
return;
}
if (root->height == 0)
return;
scoutfs_key_set_zeros(&start);
scoutfs_key_set_ones(&end);
level = root->height;
ref = root->ref;
/* first parent last ref isn't all ones in subtrees */
last_ref = false;
while(level-- > 0) {
scoutfs_block_put(sb, bl);
bl = NULL;
ret = get_ref_block(sb, NULL, NULL, 0, &ref, &bl);
if (ret) {
printk("verifying btree %s: read error %d\n",
str, ret);
break;
}
bt = bl->data;
verify_btree_block(sb, str, bt, level, last_ref, &start, &end);
if (level == 0)
break;
node = scoutfs_avl_search(&bt->item_root, cmp_key_item, key,
NULL, NULL, &next_node, NULL);
item = node_item(node ?: next_node);
if (item == NULL) {
printk("verifying btree %s: no ref item\n", str);
printk("root: height %u blkno %llu seq %016llx\n",
root->height, le64_to_cpu(root->ref.blkno),
le64_to_cpu(root->ref.seq));
printk("walk level %u start "SK_FMT" end "SK_FMT"\n",
level, SK_ARG(&start), SK_ARG(&end));
printk("block: level %u blkno %llu seq %016llx\n",
bt->level, le64_to_cpu(bt->hdr.blkno),
le64_to_cpu(bt->hdr.seq));
printk("key: "SK_FMT"\n", SK_ARG(key));
BUG();
}
if ((prev = prev_item(bt, item))) {
start = *item_key(prev);
scoutfs_key_inc(&start);
}
end = *item_key(item);
memcpy(&ref, item_val(bt, item), sizeof(ref));
last_ref = !next_item(bt, item);
}
scoutfs_block_put(sb, bl);
}
struct btree_walk_key_range {
struct scoutfs_key start;
struct scoutfs_key end;
@@ -1082,7 +1207,8 @@ static int btree_walk(struct super_block *sb,
int flags, struct scoutfs_key *key,
unsigned int val_len,
struct scoutfs_block **bl_ret,
struct btree_walk_key_range *kr)
struct btree_walk_key_range *kr,
struct scoutfs_btree_root *par_root)
{
struct scoutfs_block *par_bl = NULL;
struct scoutfs_block *bl = NULL;
@@ -1098,9 +1224,15 @@ static int btree_walk(struct super_block *sb,
unsigned int nr;
int ret;
if (WARN_ON_ONCE((flags & BTW_DIRTY) && (!alloc || !wri)))
if (WARN_ON_ONCE((flags & BTW_DIRTY) && (!alloc || !wri)) ||
WARN_ON_ONCE((flags & BTW_PAR_RNG) && !kr) ||
WARN_ON_ONCE((flags & (BTW_GET_PAR|BTW_SET_PAR)) && !par_root))
return -EINVAL;
/* all ops come through walk and walk calls all reads */
if (scoutfs_forcing_unmount(sb))
return -EIO;
scoutfs_inc_counter(sb, btree_walk);
restart:
@@ -1121,7 +1253,14 @@ restart:
ret = 0;
if (!root->height) {
if (!(flags & BTW_INSERT)) {
if (flags & BTW_GET_PAR) {
memset(par_root, 0, sizeof(*par_root));
*root = *par_root;
ret = 0;
} else if (flags & BTW_SET_PAR) {
*root = *par_root;
ret = 0;
} else if (!(flags & BTW_INSERT)) {
ret = -ENOENT;
} else {
ret = get_ref_block(sb, alloc, wri, BTW_ALLOC | BTW_DIRTY, &root->ref, &bl);
@@ -1140,14 +1279,40 @@ restart:
trace_scoutfs_btree_walk(sb, root, key, flags, level, ref);
/* par range set by ref to last parent block */
if (level < 2 && (flags & BTW_PAR_RNG)) {
ret = 0;
break;
}
if (level < 2 && (flags & BTW_GET_PAR)) {
par_root->ref = *ref;
par_root->height = level + 1;
ret = 0;
break;
}
if (level < 2 && (flags & BTW_SET_PAR)) {
if (ref == &root->ref) {
/* single parent block is replaced, can shrink/grow */
*root = *par_root;
} else {
/* subtree replacing one of parents must match height */
if (par_root->height != level + 1) {
ret = -EINVAL;
break;
}
*ref = par_root->ref;
}
ret = 0;
break;
}
ret = get_ref_block(sb, alloc, wri, flags, ref, &bl);
if (ret)
break;
bt = bl->data;
if (0 && kr)
verify_btree_block(sb, bt, level, &kr->start, &kr->end);
/* XXX more aggressive block verification, before ref updates? */
if (bt->level != level) {
scoutfs_corruption(sb, SC_BTREE_BLOCK_LEVEL,
@@ -1163,6 +1328,17 @@ restart:
break;
}
/*
* join/split won't check subtree parent root, let
* caller know when it needs to be split/join.
*/
if ((flags & BTW_SUBTREE) && level == 1 &&
(!total_above_join_low_water(bt) ||
!mid_free_item_room(bt, sizeof(struct scoutfs_block_ref)))) {
ret = -ERANGE;
break;
}
/*
* Splitting and joining can add or remove parents or
* change the parent item we use to reach the child
@@ -1288,7 +1464,7 @@ int scoutfs_btree_lookup(struct super_block *sb,
if (WARN_ON_ONCE(iref->key))
return -EINVAL;
ret = btree_walk(sb, NULL, NULL, root, 0, key, 0, &bl, NULL);
ret = btree_walk(sb, NULL, NULL, root, 0, key, 0, &bl, NULL, NULL);
if (ret == 0) {
bt = bl->data;
@@ -1340,7 +1516,7 @@ int scoutfs_btree_insert(struct super_block *sb,
return -EINVAL;
ret = btree_walk(sb, alloc, wri, root, BTW_DIRTY | BTW_INSERT, key,
val_len, &bl, NULL);
val_len, &bl, NULL, NULL);
if (ret == 0) {
bt = bl->data;
@@ -1402,7 +1578,7 @@ int scoutfs_btree_update(struct super_block *sb,
return -EINVAL;
ret = btree_walk(sb, alloc, wri, root, BTW_DIRTY | BTW_INSERT, key,
val_len, &bl, NULL);
val_len, &bl, NULL, NULL);
if (ret == 0) {
bt = bl->data;
@@ -1444,7 +1620,7 @@ int scoutfs_btree_force(struct super_block *sb,
return -EINVAL;
ret = btree_walk(sb, alloc, wri, root, BTW_DIRTY | BTW_INSERT, key,
val_len, &bl, NULL);
val_len, &bl, NULL, NULL);
if (ret == 0) {
bt = bl->data;
@@ -1482,7 +1658,7 @@ int scoutfs_btree_delete(struct super_block *sb,
scoutfs_inc_counter(sb, btree_delete);
ret = btree_walk(sb, alloc, wri, root, BTW_DELETE | BTW_DIRTY, key,
0, &bl, NULL);
0, &bl, NULL, NULL);
if (ret == 0) {
bt = bl->data;
@@ -1546,7 +1722,7 @@ static int btree_iter(struct super_block *sb,struct scoutfs_btree_root *root,
for (;;) {
ret = btree_walk(sb, NULL, NULL, root, flags, &walk_key,
0, &bl, &kr);
0, &bl, &kr, NULL);
if (ret < 0)
break;
bt = bl->data;
@@ -1619,7 +1795,8 @@ int scoutfs_btree_dirty(struct super_block *sb,
scoutfs_inc_counter(sb, btree_dirty);
ret = btree_walk(sb, alloc, wri, root, BTW_DIRTY, key, 0, &bl, NULL);
ret = btree_walk(sb, alloc, wri, root, BTW_DIRTY, key, 0, &bl,
NULL, NULL);
if (ret == 0) {
bt = bl->data;
@@ -1655,7 +1832,7 @@ int scoutfs_btree_read_items(struct super_block *sb,
struct scoutfs_block *bl;
int ret;
ret = btree_walk(sb, NULL, NULL, root, 0, key, 0, &bl, &kr);
ret = btree_walk(sb, NULL, NULL, root, 0, key, 0, &bl, &kr, NULL);
if (ret < 0)
goto out;
bt = bl->data;
@@ -1710,7 +1887,7 @@ int scoutfs_btree_insert_list(struct super_block *sb,
while (lst) {
ret = btree_walk(sb, alloc, wri, root, BTW_DIRTY | BTW_INSERT,
&lst->key, lst->val_len, &bl, &kr);
&lst->key, lst->val_len, &bl, &kr, NULL);
if (ret < 0)
goto out;
bt = bl->data;
@@ -1738,3 +1915,542 @@ int scoutfs_btree_insert_list(struct super_block *sb,
out:
return ret;
}
/*
* Descend towards the leaf that would contain the key. As we arrive at
* the last parent block, set start and end to the range of keys that
* could be found through traversal of that last parent.
*
* If the tree is too short for parent blocks then the max key range
* is returned.
*/
int scoutfs_btree_parent_range(struct super_block *sb,
struct scoutfs_btree_root *root,
struct scoutfs_key *key,
struct scoutfs_key *start,
struct scoutfs_key *end)
{
struct btree_walk_key_range kr;
int ret;
ret = btree_walk(sb, NULL, NULL, root, BTW_PAR_RNG, key, 0, NULL,
&kr, NULL);
if (ret == -ENOENT)
ret = 0;
*start = kr.start;
*end = kr.end;
return ret;
}
/*
* Initialize the caller's root as a subtree whose ref points to the
* last parent found as we traverse towards the leaf containing the key.
* If the tree is too small to have multiple blocks at the final parent
* level then the caller's root will be initialized to equal full input
* root. If the tree is empty then the par root will also be empty.
*/
int scoutfs_btree_get_parent(struct super_block *sb,
struct scoutfs_btree_root *root,
struct scoutfs_key *key,
struct scoutfs_btree_root *par_root)
{
return btree_walk(sb, NULL, NULL, root, BTW_GET_PAR, key, 0, NULL,
NULL, par_root);
}
/*
* Dirty a path towards the leaf block containing the key. As we reach
* the reference to the final parent block override it with the ref in
* the caller's block. If the tree only has a single block at the final
* parent level, or a single leaf block, then the entire tree is
* replaced with the caller's root.
*
* This manages allocs and frees while dirtying blocks in the path to
* the ref, but it doesn't account for allocating the blocks that are
* referenced by the ref nor freeing blocks referenced by the old ref
* that's overwritten. Keeping allocators in sync with the result of
* the ref override is the responsibility of the caller.
*/
int scoutfs_btree_set_parent(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_btree_root *root,
struct scoutfs_key *key,
struct scoutfs_btree_root *par_root)
{
trace_scoutfs_btree_set_parent(sb, root, key, par_root);
return btree_walk(sb, alloc, wri, root, BTW_DIRTY | BTW_SET_PAR,
key, 0, NULL, NULL, par_root);
}
/*
* Descend to the leaf, making sure that all the blocks conform to the
* balance constraints. Blocks below the low threshold will be joined.
* This is called to split blocks that were too large for insertions,
* but those insertions were in a distant context and we don't bother
* communicating the val_len back here. We just try to insert a max
* value.
*
* This always dirties all the way to the leaf. It could be made more
* efficient with more btree walk flags to walk and check for blocks
* that need balancing, and then walks that don't dirty unless they need
* to join/split.
*/
int scoutfs_btree_rebalance(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_btree_root *root,
struct scoutfs_key *key)
{
return btree_walk(sb, alloc, wri, root,
BTW_DIRTY | BTW_INSERT | BTW_DELETE,
key, SCOUTFS_BTREE_MAX_VAL_LEN, NULL, NULL, NULL);
}
struct merge_pos {
struct rb_node node;
struct scoutfs_btree_root *root;
struct scoutfs_key key;
unsigned int val_len;
u8 val[SCOUTFS_BTREE_MAX_VAL_LEN];
};
/*
* Find the next item in the mpos's root after its key and make sure
* that it's in its sorted position in the rbtree. We're responsible
* for freeing the mpos if we don't put it back in the pos_root. This
* happens naturally naturally when its item_root has no more items to
* merge.
*/
static int reset_mpos(struct super_block *sb, struct rb_root *pos_root,
struct merge_pos *mpos, struct scoutfs_key *end,
scoutfs_btree_merge_cmp_t merge_cmp)
{
SCOUTFS_BTREE_ITEM_REF(iref);
struct merge_pos *walk;
struct rb_node *parent;
struct rb_node **node;
int key_cmp;
int val_cmp;
int ret;
restart:
if (!RB_EMPTY_NODE(&mpos->node)) {
rb_erase(&mpos->node, pos_root);
RB_CLEAR_NODE(&mpos->node);
}
/* find the next item in the root within end */
ret = scoutfs_btree_next(sb, mpos->root, &mpos->key, &iref);
if (ret == 0) {
if (scoutfs_key_compare(iref.key, end) > 0) {
ret = -ENOENT;
} else {
mpos->key = *iref.key;
mpos->val_len = iref.val_len;
memcpy(mpos->val, iref.val, iref.val_len);
}
scoutfs_btree_put_iref(&iref);
}
if (ret < 0) {
kfree(mpos);
if (ret == -ENOENT)
ret = 0;
goto out;
}
rewalk:
/* sort merge items by key then oldest to newest */
node = &pos_root->rb_node;
parent = NULL;
while (*node) {
parent = *node;
walk = container_of(*node, struct merge_pos, node);
key_cmp = scoutfs_key_compare(&mpos->key, &walk->key);
val_cmp = merge_cmp(mpos->val, mpos->val_len,
walk->val, walk->val_len);
/* drop old versions of logged keys as we discover them */
if (key_cmp == 0) {
scoutfs_inc_counter(sb, btree_merge_drop_old);
if (val_cmp < 0) {
scoutfs_key_inc(&mpos->key);
goto restart;
} else {
BUG_ON(val_cmp == 0);
rb_erase(&walk->node, pos_root);
kfree(walk);
goto rewalk;
}
}
if ((key_cmp ?: val_cmp) < 0)
node = &(*node)->rb_left;
else
node = &(*node)->rb_right;
}
rb_link_node(&mpos->node, parent, node);
rb_insert_color(&mpos->node, pos_root);
ret = 0;
out:
return ret;
}
static struct merge_pos *first_mpos(struct rb_root *root)
{
struct rb_node *node = rb_first(root);
if (node)
return container_of(node, struct merge_pos, node);
return NULL;
}
/*
* Merge items from a number of read-only input roots into a writable
* destination root. The order of the input roots doesn't matter, the
* items are merged in sorted key order.
*
* The merge_cmp callback determines the order that the input items are
* merged in. The is_del callback determines if a merging item should
* be removed from the destination.
*
* subtree indicates that the destination root is in fact one of many
* parent blocks and shouldn't be split or allowed to fall below the
* join low water mark.
*
* drop_val indicates the initial length of the value that should be
* dropped when merging items into destination items.
*
* -ERANGE is returned if the merge doesn't fully exhaust the range, due
* to allocators running low or needing to join/split the parent.
* *next_ret is set to the next key which hasn't been merged so that the
* caller can retry with a new allocator and subtree.
*/
int scoutfs_btree_merge(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_key *start,
struct scoutfs_key *end,
struct scoutfs_key *next_ret,
struct scoutfs_btree_root *root,
struct list_head *inputs,
scoutfs_btree_merge_cmp_t merge_cmp,
scoutfs_btree_merge_is_del_t merge_is_del, bool subtree,
int drop_val, int dirty_limit, int alloc_low)
{
struct scoutfs_btree_root_head *rhead;
struct rb_root pos_root = RB_ROOT;
struct scoutfs_btree_item *item;
struct scoutfs_btree_block *bt;
struct scoutfs_block *bl = NULL;
struct btree_walk_key_range kr;
struct scoutfs_avl_node *par;
struct merge_pos *mpos;
struct merge_pos *tmp;
int walk_val_len;
int walk_flags;
bool is_del;
int cmp;
int ret;
trace_scoutfs_btree_merge(sb, root, start, end);
scoutfs_inc_counter(sb, btree_merge);
list_for_each_entry(rhead, inputs, head) {
mpos = kmalloc(sizeof(*mpos), GFP_NOFS);
if (!mpos) {
ret = -ENOMEM;
goto out;
}
RB_CLEAR_NODE(&mpos->node);
mpos->key = *start;
mpos->root = &rhead->root;
ret = reset_mpos(sb, &pos_root, mpos, end, merge_cmp);
if (ret < 0)
goto out;
}
walk_flags = BTW_DIRTY;
if (subtree)
walk_flags |= BTW_SUBTREE;
walk_val_len = 0;
while ((mpos = first_mpos(&pos_root))) {
if (scoutfs_block_writer_dirty_bytes(sb, wri) >= dirty_limit) {
scoutfs_inc_counter(sb, btree_merge_dirty_limit);
ret = -ERANGE;
*next_ret = mpos->key;
goto out;
}
if (scoutfs_alloc_meta_low(sb, alloc, alloc_low)) {
scoutfs_inc_counter(sb, btree_merge_alloc_low);
ret = -ERANGE;
*next_ret = mpos->key;
goto out;
}
scoutfs_block_put(sb, bl);
bl = NULL;
ret = btree_walk(sb, alloc, wri, root, walk_flags,
&mpos->key, walk_val_len, &bl, &kr, NULL);
if (ret < 0) {
if (ret == -ERANGE)
*next_ret = mpos->key;
goto out;
}
bt = bl->data;
scoutfs_inc_counter(sb, btree_merge_walk);
for (; mpos; mpos = first_mpos(&pos_root)) {
/* val must have at least what we need to drop */
if (mpos->val_len < drop_val) {
ret = -EIO;
goto out;
}
/* walk to new leaf if we exceed parent ref key */
if (scoutfs_key_compare(&mpos->key, &kr.end) > 0)
break;
/* see if there's an existing item */
item = leaf_item_hash_search(sb, bt, &mpos->key);
is_del = merge_is_del(mpos->val, mpos->val_len);
trace_scoutfs_btree_merge_items(sb, mpos->root,
&mpos->key, mpos->val_len,
item ? root : NULL,
item ? item_key(item) : NULL,
item ? item_val_len(item) : 0, is_del);
/* rewalk and split if ins/update needs room */
if (!is_del && !mid_free_item_room(bt, mpos->val_len)) {
walk_flags |= BTW_INSERT;
walk_val_len = mpos->val_len;
break;
}
/* insert missing non-deletion merge items */
if (!item && !is_del) {
scoutfs_avl_search(&bt->item_root,
cmp_key_item, &mpos->key,
&cmp, &par, NULL, NULL);
create_item(bt, &mpos->key,
mpos->val + drop_val,
mpos->val_len - drop_val, par, cmp);
scoutfs_inc_counter(sb, btree_merge_insert);
}
/* update existing items */
if (item && !is_del) {
update_item_value(bt, item,
mpos->val + drop_val,
mpos->val_len - drop_val);
scoutfs_inc_counter(sb, btree_merge_update);
}
/* delete if merge item was deletion */
if (item && is_del) {
/* rewalk and join if non-root falls under low water mark */
if (root->ref.blkno != bt->hdr.blkno &&
!total_above_join_low_water(bt)) {
walk_flags |= BTW_DELETE;
break;
}
delete_item(bt, item, NULL);
scoutfs_inc_counter(sb, btree_merge_delete);
}
/* reset walk args now that we're not split/join */
walk_flags &= ~(BTW_INSERT | BTW_DELETE);
walk_val_len = 0;
/* finished with this merge item */
scoutfs_key_inc(&mpos->key);
ret = reset_mpos(sb, &pos_root, mpos, end, merge_cmp);
if (ret < 0)
goto out;
mpos = NULL;
}
}
ret = 0;
out:
scoutfs_block_put(sb, bl);
rbtree_postorder_for_each_entry_safe(mpos, tmp, &pos_root, node) {
kfree(mpos);
}
return ret;
}
/*
* Free all the blocks referenced by a btree. The btree is only read,
* this does not update the blocks as it frees. The caller ensures that
* these btrees aren't been modified.
*
* The caller's key tracks which blocks have been freed. It must be
* initialized to zeros before the first call to start freeing blocks.
* Once a block is freed the key is updated such that the freed block
* will not be read again.
*
* Returns 0 when progress has been made successfully, which includes
* partial progress. The key is set to all ones once we've freed all
* the blocks.
*
* This works by descending to the last parent block and freeing all its
* leaf blocks without reading them. As it descends it remembers the
* number of parent blocks which were traversed through their final
* child ref. If we free all the leaf blocks then all these parent
* blocks are no longer needed and can be freed. The caller's key is
* updated to past the subtree that we just freed and we retry the
* descent from the root through the next set of parents to the next set
* of leaf blocks to free.
*/
int scoutfs_btree_free_blocks(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_key *key,
struct scoutfs_btree_root *root, int alloc_low)
{
u64 blknos[SCOUTFS_BTREE_MAX_HEIGHT];
struct scoutfs_block *bl = NULL;
struct scoutfs_btree_item *item;
struct scoutfs_btree_block *bt;
struct scoutfs_block_ref ref;
struct scoutfs_avl_node *node;
struct scoutfs_avl_node *next;
struct scoutfs_key par_next;
int nr_par;
int level;
int ret;
int i;
if (WARN_ON_ONCE(root->height > ARRAY_SIZE(blknos)))
return -EIO; /* XXX corruption */
if (root->height == 0) {
scoutfs_key_set_ones(key);
return 0;
}
if (scoutfs_key_is_ones(key))
return 0;
/* just free a single leaf block */
if (root->height == 1) {
ret = scoutfs_free_meta(sb, alloc, wri,
le64_to_cpu(root->ref.blkno));
if (ret == 0) {
trace_scoutfs_btree_free_blocks_single(sb, root,
le64_to_cpu(root->ref.blkno));
scoutfs_key_set_ones(key);
}
goto out;
}
for (;;) {
/* start the walk at the root block */
level = root->height - 1;
ref = root->ref;
scoutfs_key_set_ones(&par_next);
nr_par = 0;
/* read blocks until we read the last parent */
for (;;) {
scoutfs_block_put(sb, bl);
bl = NULL;
ret = get_ref_block(sb, alloc, wri, 0, &ref, &bl);
if (ret < 0)
goto out;
bt = bl->data;
node = scoutfs_avl_search(&bt->item_root, cmp_key_item,
key, NULL, NULL, &next, NULL);
if (node == NULL)
node = next;
/* should never descend into parent with no more refs */
if (WARN_ON_ONCE(node == NULL)) {
ret = -EIO;
goto out;
}
/* we'll free refs in the last parent */
if (level == 1)
break;
item = node_item(node);
next = scoutfs_avl_next(&bt->item_root, node);
if (next) {
/* didn't take last ref, still need parents */
nr_par = 0;
par_next = *item_key(item);
scoutfs_key_inc(&par_next);
} else {
/* final ref, could free after all leaves */
blknos[nr_par++] = le64_to_cpu(bt->hdr.blkno);
}
memcpy(&ref, item_val(bt, item), sizeof(ref));
level--;
}
/* free all leaf block refs in last parent */
while (node) {
/* make sure we can always free parents after leaves */
if (scoutfs_alloc_meta_low(sb, alloc,
alloc_low + nr_par + 1)) {
ret = 0;
goto out;
}
item = node_item(node);
memcpy(&ref, item_val(bt, item), sizeof(ref));
trace_scoutfs_btree_free_blocks_leaf(sb, root,
le64_to_cpu(ref.blkno));
ret = scoutfs_free_meta(sb, alloc, wri,
le64_to_cpu(ref.blkno));
if (ret < 0)
goto out;
node = scoutfs_avl_next(&bt->item_root, node);
if (node) {
/* done with keys in child we just freed */
*key = *item_key(item);
scoutfs_key_inc(key);
}
}
/* now that leaves are freed, free any empty parents */
for (i = 0; i < nr_par; i++) {
trace_scoutfs_btree_free_blocks_parent(sb, root,
blknos[i]);
ret = scoutfs_free_meta(sb, alloc, wri, blknos[i]);
BUG_ON(ret); /* checked meta low, freed should fit */
}
/* restart walk past the subtree we just freed */
*key = par_next;
/* but done if we just freed all parents down right spine */
if (scoutfs_key_is_ones(&par_next)) {
ret = 0;
goto out;
}
}
out:
scoutfs_block_put(sb, bl);
return ret;
}

View File

@@ -82,6 +82,58 @@ int scoutfs_btree_insert_list(struct super_block *sb,
struct scoutfs_btree_root *root,
struct scoutfs_btree_item_list *lst);
int scoutfs_btree_parent_range(struct super_block *sb,
struct scoutfs_btree_root *root,
struct scoutfs_key *key,
struct scoutfs_key *start,
struct scoutfs_key *end);
int scoutfs_btree_get_parent(struct super_block *sb,
struct scoutfs_btree_root *root,
struct scoutfs_key *key,
struct scoutfs_btree_root *par_root);
int scoutfs_btree_set_parent(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_btree_root *root,
struct scoutfs_key *key,
struct scoutfs_btree_root *par_root);
int scoutfs_btree_rebalance(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_btree_root *root,
struct scoutfs_key *key);
/* merge input is a list of roots */
struct scoutfs_btree_root_head {
struct list_head head;
struct scoutfs_btree_root root;
};
/*
* Compare the values of merge input items whose keys are equal to
* determine their merge order.
*/
typedef int (*scoutfs_btree_merge_cmp_t)(void *a_val, int a_val_len,
void *b_val, int b_val_len);
/* whether merging item should be removed from destination */
typedef bool (*scoutfs_btree_merge_is_del_t)(void *val, int val_len);
int scoutfs_btree_merge(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_key *start,
struct scoutfs_key *end,
struct scoutfs_key *next_ret,
struct scoutfs_btree_root *root,
struct list_head *input_list,
scoutfs_btree_merge_cmp_t merge_cmp,
scoutfs_btree_merge_is_del_t merge_is_del, bool subtree,
int drop_val, int dirty_limit, int alloc_low);
int scoutfs_btree_free_blocks(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_key *key,
struct scoutfs_btree_root *root, int alloc_low);
void scoutfs_btree_put_iref(struct scoutfs_btree_item_ref *iref);
#endif

View File

@@ -31,6 +31,7 @@
#include "net.h"
#include "endian_swap.h"
#include "quorum.h"
#include "omap.h"
/*
* The client is responsible for maintaining a connection to the server.
@@ -47,6 +48,7 @@ struct client_info {
struct workqueue_struct *workq;
struct delayed_work connect_dwork;
unsigned long connect_delay_jiffies;
u64 server_term;
@@ -150,7 +152,7 @@ static int client_lock_response(struct super_block *sb,
void *resp, unsigned int resp_len,
int error, void *data)
{
if (resp_len != sizeof(struct scoutfs_net_lock_grant_response))
if (resp_len != sizeof(struct scoutfs_net_lock))
return -EINVAL;
/* XXX error? */
@@ -215,6 +217,86 @@ int scoutfs_client_srch_commit_compact(struct super_block *sb,
res, sizeof(*res), NULL, 0);
}
int scoutfs_client_get_log_merge(struct super_block *sb,
struct scoutfs_log_merge_request *req)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
return scoutfs_net_sync_request(sb, client->conn,
SCOUTFS_NET_CMD_GET_LOG_MERGE,
NULL, 0, req, sizeof(*req));
}
int scoutfs_client_commit_log_merge(struct super_block *sb,
struct scoutfs_log_merge_complete *comp)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
return scoutfs_net_sync_request(sb, client->conn,
SCOUTFS_NET_CMD_COMMIT_LOG_MERGE,
comp, sizeof(*comp), NULL, 0);
}
int scoutfs_client_send_omap_response(struct super_block *sb, u64 id,
struct scoutfs_open_ino_map *map)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
return scoutfs_net_response(sb, client->conn, SCOUTFS_NET_CMD_OPEN_INO_MAP,
id, 0, map, sizeof(*map));
}
/* The client is receiving an omap request from the server */
static int client_open_ino_map(struct super_block *sb, struct scoutfs_net_connection *conn,
u8 cmd, u64 id, void *arg, u16 arg_len)
{
if (arg_len != sizeof(struct scoutfs_open_ino_map_args))
return -EINVAL;
return scoutfs_omap_client_handle_request(sb, id, arg);
}
/* The client is sending an omap request to the server */
int scoutfs_client_open_ino_map(struct super_block *sb, u64 group_nr,
struct scoutfs_open_ino_map *map)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
struct scoutfs_open_ino_map_args args = {
.group_nr = cpu_to_le64(group_nr),
.req_id = 0,
};
return scoutfs_net_sync_request(sb, client->conn, SCOUTFS_NET_CMD_OPEN_INO_MAP,
&args, sizeof(args), map, sizeof(*map));
}
/* The client is asking the server for the current volume options */
int scoutfs_client_get_volopt(struct super_block *sb, struct scoutfs_volume_options *volopt)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
return scoutfs_net_sync_request(sb, client->conn, SCOUTFS_NET_CMD_GET_VOLOPT,
NULL, 0, volopt, sizeof(*volopt));
}
/* The client is asking the server to update volume options */
int scoutfs_client_set_volopt(struct super_block *sb, struct scoutfs_volume_options *volopt)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
return scoutfs_net_sync_request(sb, client->conn, SCOUTFS_NET_CMD_SET_VOLOPT,
volopt, sizeof(*volopt), NULL, 0);
}
/* The client is asking the server to clear volume options */
int scoutfs_client_clear_volopt(struct super_block *sb, struct scoutfs_volume_options *volopt)
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
return scoutfs_net_sync_request(sb, client->conn, SCOUTFS_NET_CMD_CLEAR_VOLOPT,
volopt, sizeof(*volopt), NULL, 0);
}
/* The client is receiving a invalidation request from the server */
static int client_lock(struct super_block *sb,
struct scoutfs_net_connection *conn, u8 cmd, u64 id,
@@ -288,6 +370,7 @@ static int client_greeting(struct super_block *sb,
scoutfs_net_client_greeting(sb, conn, new_server);
client->server_term = le64_to_cpu(gr->server_term);
client->connect_delay_jiffies = 0;
ret = 0;
out:
return ret;
@@ -337,6 +420,20 @@ out:
return ret;
}
/*
* If we're not seeing successful connections we want to back off. Each
* connection attempt starts by setting a long connection work delay.
* We only set a shorter delay if we see a greeting response from the
* server. At that point we'll try to immediately reconnect if the
* connection is broken.
*/
static void queue_connect_dwork(struct super_block *sb, struct client_info *client)
{
if (!atomic_read(&client->shutting_down) && !scoutfs_forcing_unmount(sb))
queue_delayed_work(client->workq, &client->connect_dwork,
client->connect_delay_jiffies);
}
/*
* This work is responsible for maintaining a connection from the client
* to the server. It's queued on mount and disconnect and we requeue
@@ -376,6 +473,9 @@ static void scoutfs_client_connect_worker(struct work_struct *work)
goto out;
}
/* always wait a bit until a greeting response sets a lower delay */
client->connect_delay_jiffies = msecs_to_jiffies(CLIENT_CONNECT_DELAY_MS);
ret = scoutfs_quorum_server_sin(sb, &sin);
if (ret < 0)
goto out;
@@ -403,16 +503,14 @@ static void scoutfs_client_connect_worker(struct work_struct *work)
if (ret)
scoutfs_net_shutdown(sb, client->conn);
out:
/* always have a small delay before retrying to avoid storms */
if (ret && !atomic_read(&client->shutting_down))
queue_delayed_work(client->workq, &client->connect_dwork,
msecs_to_jiffies(CLIENT_CONNECT_DELAY_MS));
if (ret)
queue_connect_dwork(sb, client);
}
static scoutfs_net_request_t client_req_funcs[] = {
[SCOUTFS_NET_CMD_LOCK] = client_lock,
[SCOUTFS_NET_CMD_LOCK_RECOVER] = client_lock_recover,
[SCOUTFS_NET_CMD_OPEN_INO_MAP] = client_open_ino_map,
};
/*
@@ -425,8 +523,7 @@ static void client_notify_down(struct super_block *sb,
{
struct client_info *client = SCOUTFS_SB(sb)->client_info;
if (!atomic_read(&client->shutting_down))
queue_delayed_work(client->workq, &client->connect_dwork, 0);
queue_connect_dwork(sb, client);
}
int scoutfs_client_setup(struct super_block *sb)
@@ -461,7 +558,7 @@ int scoutfs_client_setup(struct super_block *sb)
goto out;
}
queue_delayed_work(client->workq, &client->connect_dwork, 0);
queue_connect_dwork(sb, client);
ret = 0;
out:
@@ -518,7 +615,7 @@ void scoutfs_client_destroy(struct super_block *sb)
if (client == NULL)
return;
if (client->server_term != 0) {
if (client->server_term != 0 && !scoutfs_forcing_unmount(sb)) {
client->sending_farewell = true;
ret = scoutfs_net_submit_request(sb, client->conn,
SCOUTFS_NET_CMD_FAREWELL,

View File

@@ -22,6 +22,17 @@ int scoutfs_client_srch_get_compact(struct super_block *sb,
struct scoutfs_srch_compact *sc);
int scoutfs_client_srch_commit_compact(struct super_block *sb,
struct scoutfs_srch_compact *res);
int scoutfs_client_get_log_merge(struct super_block *sb,
struct scoutfs_log_merge_request *req);
int scoutfs_client_commit_log_merge(struct super_block *sb,
struct scoutfs_log_merge_complete *comp);
int scoutfs_client_send_omap_response(struct super_block *sb, u64 id,
struct scoutfs_open_ino_map *map);
int scoutfs_client_open_ino_map(struct super_block *sb, u64 group_nr,
struct scoutfs_open_ino_map *map);
int scoutfs_client_get_volopt(struct super_block *sb, struct scoutfs_volume_options *volopt);
int scoutfs_client_set_volopt(struct super_block *sb, struct scoutfs_volume_options *volopt);
int scoutfs_client_clear_volopt(struct super_block *sb, struct scoutfs_volume_options *volopt);
int scoutfs_client_setup(struct super_block *sb);
void scoutfs_client_destroy(struct super_block *sb);

View File

@@ -44,6 +44,14 @@
EXPAND_COUNTER(btree_insert) \
EXPAND_COUNTER(btree_leaf_item_hash_search) \
EXPAND_COUNTER(btree_lookup) \
EXPAND_COUNTER(btree_merge) \
EXPAND_COUNTER(btree_merge_alloc_low) \
EXPAND_COUNTER(btree_merge_delete) \
EXPAND_COUNTER(btree_merge_dirty_limit) \
EXPAND_COUNTER(btree_merge_drop_old) \
EXPAND_COUNTER(btree_merge_insert) \
EXPAND_COUNTER(btree_merge_update) \
EXPAND_COUNTER(btree_merge_walk) \
EXPAND_COUNTER(btree_next) \
EXPAND_COUNTER(btree_prev) \
EXPAND_COUNTER(btree_split) \
@@ -80,6 +88,7 @@
EXPAND_COUNTER(forest_read_items) \
EXPAND_COUNTER(forest_roots_next_hint) \
EXPAND_COUNTER(forest_set_bloom_bits) \
EXPAND_COUNTER(inode_evict_intr) \
EXPAND_COUNTER(item_clear_dirty) \
EXPAND_COUNTER(item_create) \
EXPAND_COUNTER(item_delete) \
@@ -143,6 +152,12 @@
EXPAND_COUNTER(net_recv_invalid_message) \
EXPAND_COUNTER(net_recv_messages) \
EXPAND_COUNTER(net_unknown_request) \
EXPAND_COUNTER(orphan_scan) \
EXPAND_COUNTER(orphan_scan_cached) \
EXPAND_COUNTER(orphan_scan_error) \
EXPAND_COUNTER(orphan_scan_item) \
EXPAND_COUNTER(orphan_scan_omap_set) \
EXPAND_COUNTER(orphan_scan_read) \
EXPAND_COUNTER(quorum_elected) \
EXPAND_COUNTER(quorum_fence_error) \
EXPAND_COUNTER(quorum_fence_leader) \

View File

@@ -312,10 +312,9 @@ int scoutfs_data_truncate_items(struct super_block *sb, struct inode *inode,
while (iblock <= last) {
if (inode)
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks,
true);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, true, false);
else
ret = scoutfs_hold_trans(sb);
ret = scoutfs_hold_trans(sb, false);
if (ret)
break;
@@ -756,8 +755,7 @@ retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
scoutfs_inode_index_prepare(sb, &wbd->ind_locks, inode,
true) ?:
scoutfs_inode_index_try_lock_hold(sb, &wbd->ind_locks,
ind_seq);
scoutfs_inode_index_try_lock_hold(sb, &wbd->ind_locks, ind_seq, true);
} while (ret > 0);
if (ret < 0)
goto out;
@@ -1010,7 +1008,7 @@ long scoutfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
while(iblock <= last) {
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false, true);
if (ret)
goto out;
@@ -1086,7 +1084,7 @@ int scoutfs_data_init_offline_extent(struct inode *inode, u64 size,
}
/* we're updating meta_seq with offline block count */
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false, true);
if (ret < 0)
goto out;
@@ -1135,7 +1133,8 @@ static void truncate_inode_pages_extent(struct inode *inode, u64 start, u64 len)
*/
#define MOVE_DATA_EXTENTS_PER_HOLD 16
int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
u64 byte_len, struct inode *to, u64 to_off)
u64 byte_len, struct inode *to, u64 to_off, bool is_stage,
u64 data_version)
{
struct scoutfs_inode_info *from_si = SCOUTFS_I(from);
struct scoutfs_inode_info *to_si = SCOUTFS_I(to);
@@ -1145,6 +1144,7 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
struct data_ext_args from_args;
struct data_ext_args to_args;
struct scoutfs_extent ext;
struct timespec cur_time;
LIST_HEAD(locks);
bool done = false;
loff_t from_size;
@@ -1180,6 +1180,11 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
goto out;
}
if (is_stage && (data_version != SCOUTFS_I(to)->data_version)) {
ret = -ESTALE;
goto out;
}
from_iblock = from_off >> SCOUTFS_BLOCK_SM_SHIFT;
count = (byte_len + SCOUTFS_BLOCK_SM_MASK) >> SCOUTFS_BLOCK_SM_SHIFT;
to_iblock = to_off >> SCOUTFS_BLOCK_SM_SHIFT;
@@ -1202,7 +1207,7 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
/* can't stage once data_version changes */
scoutfs_inode_get_onoff(from, &junk, &from_offline);
scoutfs_inode_get_onoff(to, &junk, &to_offline);
if (from_offline || to_offline) {
if (from_offline || (to_offline && !is_stage)) {
ret = -ENODATA;
goto out;
}
@@ -1231,7 +1236,7 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
ret = scoutfs_inode_index_start(sb, &seq) ?:
scoutfs_inode_index_prepare(sb, &locks, from, true) ?:
scoutfs_inode_index_prepare(sb, &locks, to, true) ?:
scoutfs_inode_index_try_lock_hold(sb, &locks, seq);
scoutfs_inode_index_try_lock_hold(sb, &locks, seq, false);
if (ret > 0)
continue;
if (ret < 0)
@@ -1246,6 +1251,8 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
/* arbitrarily limit the number of extents per trans hold */
for (i = 0; i < MOVE_DATA_EXTENTS_PER_HOLD; i++) {
struct scoutfs_extent off_ext;
/* find the next extent to move */
ret = scoutfs_ext_next(sb, &data_ext_ops, &from_args,
from_iblock, 1, &ext);
@@ -1274,10 +1281,27 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
to_start = to_iblock + (from_start - from_iblock);
/* insert the new, fails if it overlaps */
ret = scoutfs_ext_insert(sb, &data_ext_ops, &to_args,
to_start, len,
map, ext.flags);
if (is_stage) {
ret = scoutfs_ext_next(sb, &data_ext_ops, &to_args,
to_start, 1, &off_ext);
if (ret)
break;
if (!scoutfs_ext_inside(to_start, len, &off_ext) ||
!(off_ext.flags & SEF_OFFLINE)) {
ret = -EINVAL;
break;
}
ret = scoutfs_ext_set(sb, &data_ext_ops, &to_args,
to_start, len,
map, ext.flags);
} else {
/* insert the new, fails if it overlaps */
ret = scoutfs_ext_insert(sb, &data_ext_ops, &to_args,
to_start, len,
map, ext.flags);
}
if (ret < 0)
break;
@@ -1285,10 +1309,18 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
ret = scoutfs_ext_set(sb, &data_ext_ops, &from_args,
from_start, len, 0, 0);
if (ret < 0) {
/* remove inserted new on err */
err = scoutfs_ext_remove(sb, &data_ext_ops,
&to_args, to_start,
len);
if (is_stage) {
/* re-mark dest range as offline */
WARN_ON_ONCE(!(off_ext.flags & SEF_OFFLINE));
err = scoutfs_ext_set(sb, &data_ext_ops, &to_args,
to_start, len,
0, off_ext.flags);
} else {
/* remove inserted new on err */
err = scoutfs_ext_remove(sb, &data_ext_ops,
&to_args, to_start,
len);
}
BUG_ON(err); /* XXX inconsistent */
break;
}
@@ -1316,12 +1348,15 @@ int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
up_write(&from_si->extent_sem);
up_write(&to_si->extent_sem);
from->i_ctime = from->i_mtime =
to->i_ctime = to->i_mtime = CURRENT_TIME;
cur_time = CURRENT_TIME;
if (!is_stage) {
to->i_ctime = to->i_mtime = cur_time;
scoutfs_inode_inc_data_version(to);
scoutfs_inode_set_data_seq(to);
}
from->i_ctime = from->i_mtime = cur_time;
scoutfs_inode_inc_data_version(from);
scoutfs_inode_inc_data_version(to);
scoutfs_inode_set_data_seq(from);
scoutfs_inode_set_data_seq(to);
scoutfs_update_inode_item(from, from_lock, &locks);
scoutfs_update_inode_item(to, to_lock, &locks);
@@ -1807,13 +1842,17 @@ int scoutfs_data_prepare_commit(struct super_block *sb)
return ret;
}
u64 scoutfs_data_alloc_free_bytes(struct super_block *sb)
/*
* Return true if the data allocator is lower than the caller's
* requirement and we haven't been told by the server that we're out of
* free extents.
*/
bool scoutfs_data_alloc_should_refill(struct super_block *sb, u64 blocks)
{
DECLARE_DATA_INFO(sb, datinf);
return scoutfs_dalloc_total_len(&datinf->dalloc) <<
SCOUTFS_BLOCK_SM_SHIFT;
return (scoutfs_dalloc_total_len(&datinf->dalloc) < blocks) &&
!(le32_to_cpu(datinf->dalloc.root.flags) & SCOUTFS_ALLOC_FLAG_LOW);
}
int scoutfs_data_setup(struct super_block *sb)

View File

@@ -59,7 +59,8 @@ long scoutfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len);
int scoutfs_data_init_offline_extent(struct inode *inode, u64 size,
struct scoutfs_lock *lock);
int scoutfs_data_move_blocks(struct inode *from, u64 from_off,
u64 byte_len, struct inode *to, u64 to_off);
u64 byte_len, struct inode *to, u64 to_off, bool to_stage,
u64 data_version);
int scoutfs_data_wait_check(struct inode *inode, loff_t pos, loff_t len,
u8 sef, u8 op, struct scoutfs_data_wait *ow,
@@ -85,7 +86,7 @@ void scoutfs_data_init_btrees(struct super_block *sb,
void scoutfs_data_get_btrees(struct super_block *sb,
struct scoutfs_log_trees *lt);
int scoutfs_data_prepare_commit(struct super_block *sb);
u64 scoutfs_data_alloc_free_bytes(struct super_block *sb);
bool scoutfs_data_alloc_should_refill(struct super_block *sb, u64 blocks);
int scoutfs_data_setup(struct super_block *sb);
void scoutfs_data_destroy(struct super_block *sb);

View File

@@ -30,6 +30,7 @@
#include "item.h"
#include "lock.h"
#include "hash.h"
#include "omap.h"
#include "counters.h"
#include "scoutfs_trace.h"
@@ -668,6 +669,7 @@ static struct inode *lock_hold_create(struct inode *dir, struct dentry *dentry,
umode_t mode, dev_t rdev,
struct scoutfs_lock **dir_lock,
struct scoutfs_lock **inode_lock,
struct scoutfs_lock **orph_lock,
struct list_head *ind_locks)
{
struct super_block *sb = dir->i_sb;
@@ -700,11 +702,17 @@ static struct inode *lock_hold_create(struct inode *dir, struct dentry *dentry,
if (ret)
goto out_unlock;
if (orph_lock) {
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, ino, orph_lock);
if (ret < 0)
goto out_unlock;
}
retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
scoutfs_inode_index_prepare(sb, ind_locks, dir, true) ?:
scoutfs_inode_index_prepare_ino(sb, ind_locks, ino, mode) ?:
scoutfs_inode_index_try_lock_hold(sb, ind_locks, ind_seq);
scoutfs_inode_index_try_lock_hold(sb, ind_locks, ind_seq, true);
if (ret > 0)
goto retry;
if (ret)
@@ -724,9 +732,13 @@ out_unlock:
if (ret) {
scoutfs_inode_index_unlock(sb, ind_locks);
scoutfs_unlock(sb, *dir_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, *inode_lock, SCOUTFS_LOCK_WRITE);
*dir_lock = NULL;
scoutfs_unlock(sb, *inode_lock, SCOUTFS_LOCK_WRITE);
*inode_lock = NULL;
if (orph_lock) {
scoutfs_unlock(sb, *orph_lock, SCOUTFS_LOCK_WRITE_ONLY);
*orph_lock = NULL;
}
inode = ERR_PTR(ret);
}
@@ -741,6 +753,7 @@ static int scoutfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
struct inode *inode = NULL;
struct scoutfs_lock *dir_lock = NULL;
struct scoutfs_lock *inode_lock = NULL;
struct scoutfs_inode_info *si;
LIST_HEAD(ind_locks);
u64 hash;
u64 pos;
@@ -751,9 +764,10 @@ static int scoutfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
hash = dirent_name_hash(dentry->d_name.name, dentry->d_name.len);
inode = lock_hold_create(dir, dentry, mode, rdev,
&dir_lock, &inode_lock, &ind_locks);
&dir_lock, &inode_lock, NULL, &ind_locks);
if (IS_ERR(inode))
return PTR_ERR(inode);
si = SCOUTFS_I(inode);
pos = SCOUTFS_I(dir)->next_readdir_pos++;
@@ -769,6 +783,7 @@ static int scoutfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode,
i_size_write(dir, i_size_read(dir) + dentry->d_name.len);
dir->i_mtime = dir->i_ctime = CURRENT_TIME;
inode->i_mtime = inode->i_atime = inode->i_ctime = dir->i_mtime;
si->crtime = inode->i_mtime;
if (S_ISDIR(mode)) {
inc_nlink(inode);
@@ -812,12 +827,15 @@ static int scoutfs_link(struct dentry *old_dentry,
struct super_block *sb = dir->i_sb;
struct scoutfs_lock *dir_lock;
struct scoutfs_lock *inode_lock = NULL;
struct scoutfs_lock *orph_lock = NULL;
LIST_HEAD(ind_locks);
bool del_orphan = false;
u64 dir_size;
u64 ind_seq;
u64 hash;
u64 pos;
int ret;
int err;
hash = dirent_name_hash(dentry->d_name.name, dentry->d_name.len);
@@ -841,11 +859,20 @@ static int scoutfs_link(struct dentry *old_dentry,
goto out_unlock;
dir_size = i_size_read(dir) + dentry->d_name.len;
if (inode->i_nlink == 0) {
del_orphan = true;
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, scoutfs_ino(inode),
&orph_lock);
if (ret < 0)
goto out_unlock;
}
retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
scoutfs_inode_index_prepare(sb, &ind_locks, dir, false) ?:
scoutfs_inode_index_prepare(sb, &ind_locks, inode, false) ?:
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq);
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq, true);
if (ret > 0)
goto retry;
if (ret)
@@ -855,14 +882,23 @@ retry:
if (ret)
goto out;
if (del_orphan) {
ret = scoutfs_inode_orphan_delete(sb, scoutfs_ino(inode), orph_lock);
if (ret)
goto out;
}
pos = SCOUTFS_I(dir)->next_readdir_pos++;
ret = add_entry_items(sb, scoutfs_ino(dir), hash, pos,
dentry->d_name.name, dentry->d_name.len,
scoutfs_ino(inode), inode->i_mode, dir_lock,
inode_lock);
if (ret)
if (ret) {
err = scoutfs_inode_orphan_create(sb, scoutfs_ino(inode), orph_lock);
WARN_ON_ONCE(err); /* no orphan, might not scan and delete after crash */
goto out;
}
update_dentry_info(sb, dentry, hash, pos, dir_lock);
i_size_write(dir, dir_size);
@@ -881,6 +917,8 @@ out_unlock:
scoutfs_inode_index_unlock(sb, &ind_locks);
scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, orph_lock, SCOUTFS_LOCK_WRITE_ONLY);
return ret;
}
@@ -905,6 +943,7 @@ static int scoutfs_unlink(struct inode *dir, struct dentry *dentry)
struct inode *inode = dentry->d_inode;
struct timespec ts = current_kernel_time();
struct scoutfs_lock *inode_lock = NULL;
struct scoutfs_lock *orph_lock = NULL;
struct scoutfs_lock *dir_lock = NULL;
LIST_HEAD(ind_locks);
u64 ind_seq;
@@ -922,32 +961,36 @@ static int scoutfs_unlink(struct inode *dir, struct dentry *dentry)
goto unlock;
}
if (should_orphan(inode)) {
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, scoutfs_ino(inode),
&orph_lock);
if (ret < 0)
goto unlock;
}
retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
scoutfs_inode_index_prepare(sb, &ind_locks, dir, false) ?:
scoutfs_inode_index_prepare(sb, &ind_locks, inode, false) ?:
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq);
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq, false);
if (ret > 0)
goto retry;
if (ret)
goto unlock;
if (should_orphan(inode)) {
ret = scoutfs_inode_orphan_create(sb, scoutfs_ino(inode), orph_lock);
if (ret < 0)
goto out;
}
ret = del_entry_items(sb, scoutfs_ino(dir), dentry_info_hash(dentry),
dentry_info_pos(dentry), scoutfs_ino(inode),
dir_lock, inode_lock);
if (ret)
if (ret) {
ret = scoutfs_inode_orphan_delete(sb, scoutfs_ino(inode), orph_lock);
WARN_ON_ONCE(ret); /* should have been dirty */
goto out;
if (should_orphan(inode)) {
/*
* Insert the orphan item before we modify any inode
* metadata so we can gracefully exit should it
* fail.
*/
ret = scoutfs_orphan_inode(inode);
WARN_ON_ONCE(ret); /* XXX returning error but items deleted */
if (ret)
goto out;
}
dir->i_ctime = ts;
@@ -969,6 +1012,7 @@ unlock:
scoutfs_inode_index_unlock(sb, &ind_locks);
scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, orph_lock, SCOUTFS_LOCK_WRITE_ONLY);
return ret;
}
@@ -1144,6 +1188,7 @@ static int scoutfs_symlink(struct inode *dir, struct dentry *dentry,
struct inode *inode = NULL;
struct scoutfs_lock *dir_lock = NULL;
struct scoutfs_lock *inode_lock = NULL;
struct scoutfs_inode_info *si;
LIST_HEAD(ind_locks);
u64 hash;
u64 pos;
@@ -1161,9 +1206,10 @@ static int scoutfs_symlink(struct inode *dir, struct dentry *dentry,
return ret;
inode = lock_hold_create(dir, dentry, S_IFLNK|S_IRWXUGO, 0,
&dir_lock, &inode_lock, &ind_locks);
&dir_lock, &inode_lock, NULL, &ind_locks);
if (IS_ERR(inode))
return PTR_ERR(inode);
si = SCOUTFS_I(inode);
ret = symlink_item_ops(sb, SYM_CREATE, scoutfs_ino(inode), inode_lock,
symname, name_len);
@@ -1185,6 +1231,7 @@ static int scoutfs_symlink(struct inode *dir, struct dentry *dentry,
dir->i_mtime = dir->i_ctime = CURRENT_TIME;
inode->i_ctime = dir->i_mtime;
si->crtime = inode->i_ctime;
i_size_write(inode, name_len);
scoutfs_update_inode_item(inode, inode_lock, &ind_locks);
@@ -1520,6 +1567,7 @@ static int scoutfs_rename(struct inode *old_dir, struct dentry *old_dentry,
struct scoutfs_lock *new_dir_lock = NULL;
struct scoutfs_lock *old_inode_lock = NULL;
struct scoutfs_lock *new_inode_lock = NULL;
struct scoutfs_lock *orph_lock = NULL;
struct timespec now;
bool ins_new = false;
bool del_new = false;
@@ -1584,6 +1632,13 @@ static int scoutfs_rename(struct inode *old_dir, struct dentry *old_dentry,
if (ret)
goto out_unlock;
if (should_orphan(new_inode)) {
ret = scoutfs_lock_orphan(sb, SCOUTFS_LOCK_WRITE_ONLY, 0, scoutfs_ino(new_inode),
&orph_lock);
if (ret < 0)
goto out_unlock;
}
retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
scoutfs_inode_index_prepare(sb, &ind_locks, old_dir, false) ?:
@@ -1592,7 +1647,7 @@ retry:
scoutfs_inode_index_prepare(sb, &ind_locks, new_dir, false)) ?:
(new_inode == NULL ? 0 :
scoutfs_inode_index_prepare(sb, &ind_locks, new_inode, false)) ?:
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq);
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq, true);
if (ret > 0)
goto retry;
if (ret)
@@ -1643,7 +1698,7 @@ retry:
ins_old = true;
if (should_orphan(new_inode)) {
ret = scoutfs_orphan_inode(new_inode);
ret = scoutfs_inode_orphan_create(sb, scoutfs_ino(new_inode), orph_lock);
if (ret)
goto out;
}
@@ -1747,6 +1802,7 @@ out_unlock:
scoutfs_unlock(sb, old_dir_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, new_dir_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, rename_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, orph_lock, SCOUTFS_LOCK_WRITE_ONLY);
return ret;
}
@@ -1760,6 +1816,53 @@ static int scoutfs_dir_open(struct inode *inode, struct file *file)
}
#endif
static int scoutfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
{
struct super_block *sb = dir->i_sb;
struct inode *inode = NULL;
struct scoutfs_lock *dir_lock = NULL;
struct scoutfs_lock *inode_lock = NULL;
struct scoutfs_lock *orph_lock = NULL;
struct scoutfs_inode_info *si;
LIST_HEAD(ind_locks);
int ret;
if (dentry->d_name.len > SCOUTFS_NAME_LEN)
return -ENAMETOOLONG;
inode = lock_hold_create(dir, dentry, mode, 0,
&dir_lock, &inode_lock, &orph_lock, &ind_locks);
if (IS_ERR(inode))
return PTR_ERR(inode);
si = SCOUTFS_I(inode);
ret = scoutfs_inode_orphan_create(sb, scoutfs_ino(inode), orph_lock);
if (ret < 0) {
iput(inode);
goto out; /* XXX returning error but items created */
}
inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
si->crtime = inode->i_mtime;
insert_inode_hash(inode);
ihold(inode); /* need to update inode modifications in d_tmpfile */
d_tmpfile(dentry, inode);
scoutfs_update_inode_item(inode, inode_lock, &ind_locks);
scoutfs_update_inode_item(dir, dir_lock, &ind_locks);
scoutfs_inode_index_unlock(sb, &ind_locks);
iput(inode);
out:
scoutfs_release_trans(sb);
scoutfs_inode_index_unlock(sb, &ind_locks);
scoutfs_unlock(sb, dir_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, orph_lock, SCOUTFS_LOCK_WRITE_ONLY);
return ret;
}
const struct file_operations scoutfs_dir_fops = {
.KC_FOP_READDIR = scoutfs_readdir,
#ifdef KC_FMODE_KABI_ITERATE
@@ -1770,7 +1873,10 @@ const struct file_operations scoutfs_dir_fops = {
.llseek = generic_file_llseek,
};
const struct inode_operations scoutfs_dir_iops = {
const struct inode_operations_wrapper scoutfs_dir_iops = {
.ops = {
.lookup = scoutfs_lookup,
.mknod = scoutfs_mknod,
.create = scoutfs_create,
@@ -1787,6 +1893,8 @@ const struct inode_operations scoutfs_dir_iops = {
.removexattr = scoutfs_removexattr,
.symlink = scoutfs_symlink,
.permission = scoutfs_permission,
},
.tmpfile = scoutfs_tmpfile,
};
void scoutfs_dir_exit(void)

View File

@@ -5,7 +5,7 @@
#include "lock.h"
extern const struct file_operations scoutfs_dir_fops;
extern const struct inode_operations scoutfs_dir_iops;
extern const struct inode_operations_wrapper scoutfs_dir_iops;
extern const struct inode_operations scoutfs_symlink_iops;
struct scoutfs_link_backref_entry {
@@ -14,7 +14,7 @@ struct scoutfs_link_backref_entry {
u64 dir_pos;
u16 name_len;
struct scoutfs_dirent dent;
/* the full name is allocated and stored in dent.name[0] */
/* the full name is allocated and stored in dent.name[] */
};
int scoutfs_dir_get_backref_path(struct super_block *sb, u64 ino, u64 dir_ino,

View File

@@ -38,7 +38,7 @@ static bool ext_overlap(struct scoutfs_extent *ext, u64 start, u64 len)
return !(e_end < start || ext->start > end);
}
static bool ext_inside(u64 start, u64 len, struct scoutfs_extent *out)
bool scoutfs_ext_inside(u64 start, u64 len, struct scoutfs_extent *out)
{
u64 in_end = start + len - 1;
u64 out_end = out->start + out->len - 1;
@@ -241,7 +241,7 @@ int scoutfs_ext_remove(struct super_block *sb, struct scoutfs_ext_ops *ops,
goto out;
/* removed extent must be entirely within found */
if (!ext_inside(start, len, &found)) {
if (!scoutfs_ext_inside(start, len, &found)) {
ret = -EINVAL;
goto out;
}
@@ -341,7 +341,7 @@ int scoutfs_ext_set(struct super_block *sb, struct scoutfs_ext_ops *ops,
if (ret == 0 && ext_overlap(&found, start, len)) {
/* set extent must be entirely within found */
if (!ext_inside(start, len, &found)) {
if (!scoutfs_ext_inside(start, len, &found)) {
ret = -EINVAL;
goto out;
}

View File

@@ -31,5 +31,6 @@ int scoutfs_ext_alloc(struct super_block *sb, struct scoutfs_ext_ops *ops,
struct scoutfs_extent *ext);
int scoutfs_ext_set(struct super_block *sb, struct scoutfs_ext_ops *ops,
void *arg, u64 start, u64 len, u64 map, u8 flags);
bool scoutfs_ext_inside(u64 start, u64 len, struct scoutfs_extent *out);
#endif

480
kmod/src/fence.c Normal file
View File

@@ -0,0 +1,480 @@
/*
* Copyright (C) 2019 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/kobject.h>
#include <linux/sysfs.h>
#include <linux/device.h>
#include <linux/timer.h>
#include <asm/barrier.h>
#include "super.h"
#include "msg.h"
#include "sysfs.h"
#include "server.h"
#include "fence.h"
/*
* Fencing ensures that a given mount can no longer write to the
* metadata or data devices. It's necessary to ensure that it's safe to
* give another mount access to a resource that is currently owned by a
* mount that has stopped responding.
*
* Fencing is performed in collaboration between the currently elected
* quorum leader mount and userspace running on its host. The kernel
* creates fencing requests as it notices that mounts have stopped
* participating. The fence requests are published as directories in
* sysfs. Userspace agents watch for directories, take action, and
* write to files in the directory to indicate that the mount has been
* fenced. Once the mount is fenced the server can reclaim the
* resources previously held by the fenced mount.
*
* The fence requests contain metadata identifying the specific instance
* of the mount that needs to be fenced. This lets a fencing agent
* ensure that a specific mount has been fenced without necessarily
* destroying the node that was hosting it. Maybe the node had rebooted
* and the mount is no longer there, maybe the mount can be force
* unmounted, maybe the node can be configured to isolate the mount from
* the devices.
*
* The fencing mechanism is asynchronous and can fail but the server
* cannot make progress until it completes. If a fence request times
* out the server shuts down in the hope that another instance of a
* server might have more luck fencing a non-responsive mount.
*
* Sources of fencing are fundamentally anchored in shared persistent
* state. It is possible, though unlikely, that servers can fence a
* node and then themselves fail, leaving the next server to try and
* fence the mount again.
*/
struct fence_info {
struct kset *kset;
struct kobject fence_dir_kobj;
struct workqueue_struct *wq;
wait_queue_head_t waitq;
spinlock_t lock;
struct list_head list;
};
#define DECLARE_FENCE_INFO(sb, name) \
struct fence_info *name = SCOUTFS_SB(sb)->fence_info
struct pending_fence {
struct super_block *sb;
struct scoutfs_sysfs_attrs ssa;
struct list_head entry;
struct timer_list timer;
ktime_t start_kt;
__be32 ipv4_addr;
bool fenced;
bool error;
int reason;
u64 rid;
};
#define FENCE_FROM_KOBJ(kobj) \
container_of(SCOUTFS_SYSFS_ATTRS(kobj), struct pending_fence, ssa)
#define DECLARE_FENCE_FROM_KOBJ(name, kobj) \
struct pending_fence *name = FENCE_FROM_KOBJ(kobj)
static void destroy_fence(struct pending_fence *fence)
{
struct super_block *sb = fence->sb;
scoutfs_sysfs_destroy_attrs(sb, &fence->ssa);
del_timer_sync(&fence->timer);
kfree(fence);
}
static ssize_t elapsed_secs_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
ktime_t now = ktime_get();
struct timeval tv = { 0, };
if (ktime_after(now, fence->start_kt))
tv = ktime_to_timeval(ktime_sub(now, fence->start_kt));
return snprintf(buf, PAGE_SIZE, "%llu", (long long)tv.tv_sec);
}
SCOUTFS_ATTR_RO(elapsed_secs);
static ssize_t fenced_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
return snprintf(buf, PAGE_SIZE, "%u", !!fence->fenced);
}
/*
* any write to the fenced file from userspace indicates that the mount
* has been safely fenced and can no longer write to the shared device.
*/
static ssize_t fenced_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
DECLARE_FENCE_INFO(fence->sb, fi);
if (!fence->fenced) {
del_timer_sync(&fence->timer);
fence->fenced = true;
wake_up(&fi->waitq);
}
return count;
}
SCOUTFS_ATTR_RW(fenced);
static ssize_t error_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
return snprintf(buf, PAGE_SIZE, "%u", !!fence->error);
}
/*
* Fencing can tell us that they were unable to fence the given mount.
* We can't continue if the mount can't be isolated so we shut down the
* server.
*/
static ssize_t error_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf,
size_t count)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
struct super_block *sb = fence->sb;
DECLARE_FENCE_INFO(fence->sb, fi);
if (!fence->error) {
fence->error = true;
scoutfs_err(sb, "error indicated by fence action for rid %016llx", fence->rid);
wake_up(&fi->waitq);
}
return count;
}
SCOUTFS_ATTR_RW(error);
static ssize_t ipv4_addr_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
return snprintf(buf, PAGE_SIZE, "%pI4", &fence->ipv4_addr);
}
SCOUTFS_ATTR_RO(ipv4_addr);
static ssize_t reason_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
unsigned r = fence->reason;
char *str = "unknown";
static char *reasons[] = {
[SCOUTFS_FENCE_CLIENT_RECOVERY] = "client_recovery",
[SCOUTFS_FENCE_CLIENT_RECONNECT] = "client_reconnect",
[SCOUTFS_FENCE_QUORUM_BLOCK_LEADER] = "quorum_block_leader",
};
if (r < ARRAY_SIZE(reasons) && reasons[r])
str = reasons[r];
return snprintf(buf, PAGE_SIZE, "%s", str);
}
SCOUTFS_ATTR_RO(reason);
static ssize_t rid_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
DECLARE_FENCE_FROM_KOBJ(fence, kobj);
return snprintf(buf, PAGE_SIZE, "%016llx", fence->rid);
}
SCOUTFS_ATTR_RO(rid);
static struct attribute *fence_attrs[] = {
SCOUTFS_ATTR_PTR(elapsed_secs),
SCOUTFS_ATTR_PTR(fenced),
SCOUTFS_ATTR_PTR(error),
SCOUTFS_ATTR_PTR(ipv4_addr),
SCOUTFS_ATTR_PTR(reason),
SCOUTFS_ATTR_PTR(rid),
NULL,
};
#define FENCE_TIMEOUT_MS (MSEC_PER_SEC * 30)
static void fence_timeout(struct timer_list *timer)
{
struct pending_fence *fence = from_timer(fence, timer, timer);
struct super_block *sb = fence->sb;
DECLARE_FENCE_INFO(sb, fi);
fence->error = true;
scoutfs_err(sb, "fence request for rid %016llx was not serviced in %lums, raising error",
fence->rid, FENCE_TIMEOUT_MS);
wake_up(&fi->waitq);
}
int scoutfs_fence_start(struct super_block *sb, u64 rid, __be32 ipv4_addr, int reason)
{
DECLARE_FENCE_INFO(sb, fi);
struct pending_fence *fence;
int ret;
fence = kzalloc(sizeof(struct pending_fence), GFP_NOFS);
if (!fence) {
ret = -ENOMEM;
goto out;
}
fence->sb = sb;
scoutfs_sysfs_init_attrs(sb, &fence->ssa);
fence->start_kt = ktime_get();
fence->ipv4_addr = ipv4_addr;
fence->fenced = false;
fence->error = false;
fence->reason = reason;
fence->rid = rid;
ret = scoutfs_sysfs_create_attrs_parent(sb, &fi->kset->kobj,
&fence->ssa, fence_attrs,
"%016llx", rid);
if (ret < 0) {
kfree(fence);
goto out;
}
timer_setup(&fence->timer, fence_timeout, 0);
fence->timer.expires = jiffies + msecs_to_jiffies(FENCE_TIMEOUT_MS);
add_timer(&fence->timer);
spin_lock(&fi->lock);
list_add_tail(&fence->entry, &fi->list);
spin_unlock(&fi->lock);
out:
return ret;
}
/*
* Give the caller the rid of the next fence request which has been
* fenced. This doesn't have a position from which to return the next
* because the caller either frees the fence request it's given or shuts
* down.
*/
int scoutfs_fence_next(struct super_block *sb, u64 *rid, int *reason, bool *error)
{
DECLARE_FENCE_INFO(sb, fi);
struct pending_fence *fence;
int ret = -ENOENT;
spin_lock(&fi->lock);
list_for_each_entry(fence, &fi->list, entry) {
if (fence->fenced || fence->error) {
*rid = fence->rid;
*reason = fence->reason;
*error = fence->error;
ret = 0;
break;
}
}
spin_unlock(&fi->lock);
return ret;
}
int scoutfs_fence_reason_pending(struct super_block *sb, int reason)
{
DECLARE_FENCE_INFO(sb, fi);
struct pending_fence *fence;
bool pending = false;
spin_lock(&fi->lock);
list_for_each_entry(fence, &fi->list, entry) {
if (fence->reason == reason) {
pending = true;
break;
}
}
spin_unlock(&fi->lock);
return pending;
}
int scoutfs_fence_free(struct super_block *sb, u64 rid)
{
DECLARE_FENCE_INFO(sb, fi);
struct pending_fence *fence;
int ret = -ENOENT;
spin_lock(&fi->lock);
list_for_each_entry(fence, &fi->list, entry) {
if (fence->rid == rid) {
list_del_init(&fence->entry);
ret = 0;
break;
}
}
spin_unlock(&fi->lock);
if (ret == 0) {
destroy_fence(fence);
wake_up(&fi->waitq);
}
return ret;
}
static bool all_fenced(struct fence_info *fi, bool *error)
{
struct pending_fence *fence;
bool all = true;
*error = false;
spin_lock(&fi->lock);
list_for_each_entry(fence, &fi->list, entry) {
if (fence->error) {
*error = true;
all = true;
break;
}
if (!fence->fenced) {
all = false;
break;
}
}
spin_unlock(&fi->lock);
return all;
}
/*
* The caller waits for all the current requests to be fenced, but not
* necessarily reclaimed.
*/
int scoutfs_fence_wait_fenced(struct super_block *sb, long timeout_jiffies)
{
DECLARE_FENCE_INFO(sb, fi);
bool error;
long ret;
ret = wait_event_interruptible_timeout(fi->waitq, all_fenced(fi, &error), timeout_jiffies);
if (ret == 0)
ret = -ETIMEDOUT;
else if (ret > 0)
ret = 0;
else if (error)
ret = -EIO;
return ret;
}
/*
* This must be called early during startup so that it is guaranteed that
* no other subsystems will try and call fence_start while we're waiting
* for testing fence requests to complete.
*/
int scoutfs_fence_setup(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct mount_options *opts = &sbi->opts;
struct fence_info *fi;
int ret;
/* can only fence if we can be elected by quorum */
if (opts->quorum_slot_nr == -1) {
ret = 0;
goto out;
}
fi = kzalloc(sizeof(struct fence_info), GFP_KERNEL);
if (!fi) {
ret = -ENOMEM;
goto out;
}
init_waitqueue_head(&fi->waitq);
spin_lock_init(&fi->lock);
INIT_LIST_HEAD(&fi->list);
sbi->fence_info = fi;
fi->kset = kset_create_and_add("fence", NULL, scoutfs_sysfs_sb_dir(sb));
if (!fi->kset) {
ret = -ENOMEM;
goto out;
}
fi->wq = alloc_workqueue("scoutfs_fence",
WQ_UNBOUND | WQ_NON_REENTRANT, 0);
if (!fi->wq) {
ret = -ENOMEM;
goto out;
}
ret = 0;
out:
if (ret)
scoutfs_fence_destroy(sb);
return ret;
}
/*
* Tear down all pending fence requests because the server is shutting down.
*/
void scoutfs_fence_stop(struct super_block *sb)
{
DECLARE_FENCE_INFO(sb, fi);
struct pending_fence *fence;
do {
spin_lock(&fi->lock);
fence = list_first_entry_or_null(&fi->list, struct pending_fence, entry);
if (fence)
list_del_init(&fence->entry);
spin_unlock(&fi->lock);
if (fence) {
destroy_fence(fence);
wake_up(&fi->waitq);
}
} while (fence);
}
void scoutfs_fence_destroy(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct fence_info *fi = SCOUTFS_SB(sb)->fence_info;
struct pending_fence *fence;
struct pending_fence *tmp;
if (fi) {
if (fi->wq)
destroy_workqueue(fi->wq);
list_for_each_entry_safe(fence, tmp, &fi->list, entry)
destroy_fence(fence);
if (fi->kset)
kset_unregister(fi->kset);
kfree(fi);
sbi->fence_info = NULL;
}
}

20
kmod/src/fence.h Normal file
View File

@@ -0,0 +1,20 @@
#ifndef _SCOUTFS_FENCE_H_
#define _SCOUTFS_FENCE_H_
enum {
SCOUTFS_FENCE_CLIENT_RECOVERY,
SCOUTFS_FENCE_CLIENT_RECONNECT,
SCOUTFS_FENCE_QUORUM_BLOCK_LEADER,
};
int scoutfs_fence_start(struct super_block *sb, u64 rid, __be32 ipv4_addr, int reason);
int scoutfs_fence_next(struct super_block *sb, u64 *rid, int *reason, bool *error);
int scoutfs_fence_reason_pending(struct super_block *sb, int reason);
int scoutfs_fence_free(struct super_block *sb, u64 rid);
int scoutfs_fence_wait_fenced(struct super_block *sb, long timeout_jiffies);
int scoutfs_fence_setup(struct super_block *sb);
void scoutfs_fence_stop(struct super_block *sb);
void scoutfs_fence_destroy(struct super_block *sb);
#endif

View File

@@ -27,8 +27,14 @@
#include "file.h"
#include "inode.h"
#include "per_task.h"
#include "omap.h"
/* TODO: Direct I/O, AIO */
/*
* Start a high level file read. We check for offline extents in the
* read region here so that we only check the extents once. We use the
* dio count to prevent releasing while we're reading after we've
* checked the extents.
*/
ssize_t scoutfs_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t pos)
{
@@ -42,30 +48,32 @@ ssize_t scoutfs_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
int ret;
retry:
/* protect checked extents from release */
mutex_lock(&inode->i_mutex);
atomic_inc(&inode->i_dio_count);
mutex_unlock(&inode->i_mutex);
ret = scoutfs_lock_inode(sb, SCOUTFS_LOCK_READ,
SCOUTFS_LKF_REFRESH_INODE, inode, &inode_lock);
if (ret)
goto out;
if (scoutfs_per_task_add_excl(&si->pt_data_lock, &pt_ent, inode_lock)) {
/* protect checked extents from stage/release */
mutex_lock(&inode->i_mutex);
atomic_inc(&inode->i_dio_count);
mutex_unlock(&inode->i_mutex);
ret = scoutfs_data_wait_check_iov(inode, iov, nr_segs, pos,
SEF_OFFLINE,
SCOUTFS_IOC_DWO_READ,
&dw, inode_lock);
if (ret != 0)
goto out;
} else {
WARN_ON_ONCE(true);
}
ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
out:
if (scoutfs_per_task_del(&si->pt_data_lock, &pt_ent))
inode_dio_done(inode);
inode_dio_done(inode);
scoutfs_per_task_del(&si->pt_data_lock, &pt_ent);
scoutfs_unlock(sb, inode_lock, SCOUTFS_LOCK_READ);
if (scoutfs_data_wait_found(&dw)) {

View File

@@ -37,9 +37,9 @@
*
* The log btrees are modified by multiple transactions over time so
* there is no consistent ordering relationship between the items in
* different btrees. Each item in a log btree stores a version number
* for the item. Readers check log btrees for the most recent version
* that it should use.
* different btrees. Each item in a log btree stores a seq for the
* item. Readers check log btrees for the most recent seq that it
* should use.
*
* The item cache reads items in bulk from stable btrees, and writes a
* transaction's worth of dirty items into the item log btree.
@@ -52,6 +52,8 @@
*/
struct forest_info {
struct super_block *sb;
struct mutex mutex;
struct scoutfs_alloc *alloc;
struct scoutfs_block_writer *wri;
@@ -60,6 +62,9 @@ struct forest_info {
struct mutex srch_mutex;
struct scoutfs_srch_file srch_file;
struct scoutfs_block *srch_bl;
struct workqueue_struct *workq;
struct delayed_work log_merge_dwork;
};
#define DECLARE_FOREST_INFO(sb, name) \
@@ -249,7 +254,7 @@ static int forest_read_items(struct super_block *sb, struct scoutfs_key *key,
* If we hit stale blocks and retry we can call the callback for
* duplicate items. This is harmless because the items are stable while
* the caller holds their cluster lock and the caller has to filter out
* item versions anyway.
* item seqs anyway.
*/
int scoutfs_forest_read_items(struct super_block *sb,
struct scoutfs_lock *lock,
@@ -276,7 +281,6 @@ int scoutfs_forest_read_items(struct super_block *sb,
scoutfs_inc_counter(sb, forest_read_items);
calc_bloom_nrs(&bloom, &lock->start);
roots = lock->roots;
retry:
ret = scoutfs_client_get_roots(sb, &roots);
if (ret)
@@ -349,15 +353,9 @@ retry:
ret = 0;
out:
if (ret == -ESTALE) {
if (memcmp(&prev_refs, &refs, sizeof(refs)) == 0) {
ret = -EIO;
goto out;
}
if (memcmp(&prev_refs, &refs, sizeof(refs)) == 0)
return -EIO;
prev_refs = refs;
ret = scoutfs_client_get_roots(sb, &roots);
if (ret)
goto out;
goto retry;
}
@@ -433,29 +431,29 @@ out:
/*
* The caller is commiting items in the transaction and has found the
* greatest item version amongst them. We store it in the log_trees root
* greatest item seq amongst them. We store it in the log_trees root
* to send to the server.
*/
void scoutfs_forest_set_max_vers(struct super_block *sb, u64 max_vers)
void scoutfs_forest_set_max_seq(struct super_block *sb, u64 max_seq)
{
DECLARE_FOREST_INFO(sb, finf);
finf->our_log.max_item_vers = cpu_to_le64(max_vers);
finf->our_log.max_item_seq = cpu_to_le64(max_seq);
}
/*
* The server is calling during setup to find the greatest item version
* The server is calling during setup to find the greatest item seq
* amongst all the log tree roots. They have the authoritative current
* super.
*
* Item versions are only used to compare items in log trees, not in the
* main fs tree. All we have to do is find the greatest version amongst
* the log_trees so that new locks will have a write_version greater
* than all the items in the log_trees.
* Item seqs are only used to compare items in log trees, not in the
* main fs tree. All we have to do is find the greatest seq amongst the
* log_trees so that the core seq will have a greater seq than all the
* items in the log_trees.
*/
int scoutfs_forest_get_max_vers(struct super_block *sb,
struct scoutfs_super_block *super,
u64 *vers)
int scoutfs_forest_get_max_seq(struct super_block *sb,
struct scoutfs_super_block *super,
u64 *seq)
{
struct scoutfs_log_trees *lt;
SCOUTFS_BTREE_ITEM_REF(iref);
@@ -463,7 +461,7 @@ int scoutfs_forest_get_max_vers(struct super_block *sb,
int ret;
scoutfs_key_init_log_trees(&ltk, 0, 0);
*vers = 0;
*seq = 0;
for (;; scoutfs_key_inc(&ltk)) {
ret = scoutfs_btree_next(sb, &super->logs_root, &ltk, &iref);
@@ -471,8 +469,7 @@ int scoutfs_forest_get_max_vers(struct super_block *sb,
if (iref.val_len == sizeof(struct scoutfs_log_trees)) {
ltk = *iref.key;
lt = iref.val;
*vers = max(*vers,
le64_to_cpu(lt->max_item_vers));
*seq = max(*seq, le64_to_cpu(lt->max_item_seq));
} else {
ret = -EIO;
}
@@ -541,7 +538,7 @@ void scoutfs_forest_init_btrees(struct super_block *sb,
memset(&finf->our_log, 0, sizeof(finf->our_log));
finf->our_log.item_root = lt->item_root;
finf->our_log.bloom_ref = lt->bloom_ref;
finf->our_log.max_item_vers = lt->max_item_vers;
finf->our_log.max_item_seq = lt->max_item_seq;
finf->our_log.rid = lt->rid;
finf->our_log.nr = lt->nr;
finf->srch_file = lt->srch_file;
@@ -571,7 +568,7 @@ void scoutfs_forest_get_btrees(struct super_block *sb,
lt->item_root = finf->our_log.item_root;
lt->bloom_ref = finf->our_log.bloom_ref;
lt->srch_file = finf->srch_file;
lt->max_item_vers = finf->our_log.max_item_vers;
lt->max_item_seq = finf->our_log.max_item_seq;
scoutfs_block_put(sb, finf->srch_bl);
finf->srch_bl = NULL;
@@ -580,6 +577,149 @@ void scoutfs_forest_get_btrees(struct super_block *sb,
&lt->bloom_ref);
}
/*
* Compare input items to merge by their log item value seq when their
* keys match.
*/
static int merge_cmp(void *a_val, int a_val_len, void *b_val, int b_val_len)
{
struct scoutfs_log_item_value *a = a_val;
struct scoutfs_log_item_value *b = b_val;
/* sort merge item by seq */
return scoutfs_cmp(le64_to_cpu(a->seq), le64_to_cpu(b->seq));
}
static bool merge_is_del(void *val, int val_len)
{
struct scoutfs_log_item_value *liv = val;
return !!(liv->flags & SCOUTFS_LOG_ITEM_FLAG_DELETION);
}
#define LOG_MERGE_DELAY_MS (5 * MSEC_PER_SEC)
/*
* Regularly try to get a log merge request from the server. If we get
* a request we walk the log_trees items to find input trees and pass
* them to btree_merge. All of our work is done in dirty blocks
* allocated from available free blocks that the server gave us. If we
* hit an error then we drop our dirty blocks without writing them and
* send an error flag to the server so they can reclaim our allocators
* and ignore the rest of our work.
*/
static void scoutfs_forest_log_merge_worker(struct work_struct *work)
{
struct forest_info *finf = container_of(work, struct forest_info,
log_merge_dwork.work);
struct super_block *sb = finf->sb;
struct scoutfs_btree_root_head *rhead = NULL;
struct scoutfs_btree_root_head *tmp;
struct scoutfs_log_merge_complete comp;
struct scoutfs_log_merge_request req;
struct scoutfs_log_trees *lt;
struct scoutfs_block_writer wri;
struct scoutfs_alloc alloc;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key next;
struct scoutfs_key key;
unsigned long delay;
LIST_HEAD(inputs);
int ret;
ret = scoutfs_client_get_log_merge(sb, &req);
if (ret < 0)
goto resched;
comp.root = req.root;
comp.start = req.start;
comp.end = req.end;
comp.remain = req.end;
comp.rid = req.rid;
comp.seq = req.seq;
comp.flags = 0;
scoutfs_alloc_init(&alloc, &req.meta_avail, &req.meta_freed);
scoutfs_block_writer_init(sb, &wri);
/* find finalized input log trees up to last_seq */
for (scoutfs_key_init_log_trees(&key, 0, 0); ; scoutfs_key_inc(&key)) {
if (!rhead) {
rhead = kmalloc(sizeof(*rhead), GFP_NOFS);
if (!rhead) {
ret = -ENOMEM;
goto out;
}
}
ret = scoutfs_btree_next(sb, &req.logs_root, &key, &iref);
if (ret == 0) {
if (iref.val_len == sizeof(*lt)) {
key = *iref.key;
lt = iref.val;
if ((le64_to_cpu(lt->flags) &
SCOUTFS_LOG_TREES_FINALIZED) &&
(le64_to_cpu(lt->max_item_seq) <=
le64_to_cpu(req.last_seq))) {
rhead->root = lt->item_root;
list_add_tail(&rhead->head, &inputs);
rhead = NULL;
}
} else {
ret = -EIO;
}
scoutfs_btree_put_iref(&iref);
}
if (ret < 0) {
if (ret == -ENOENT) {
ret = 0;
break;
}
goto out;
}
}
/* shouldn't be possible, but it's harmless */
if (list_empty(&inputs)) {
ret = 0;
goto out;
}
ret = scoutfs_btree_merge(sb, &alloc, &wri, &req.start, &req.end,
&next, &comp.root, &inputs, merge_cmp,
merge_is_del,
!!(req.flags & cpu_to_le64(SCOUTFS_LOG_MERGE_REQUEST_SUBTREE)),
sizeof(struct scoutfs_log_item_value),
SCOUTFS_LOG_MERGE_DIRTY_BYTE_LIMIT, 10);
if (ret == -ERANGE) {
comp.remain = next;
le64_add_cpu(&comp.flags, SCOUTFS_LOG_MERGE_COMP_REMAIN);
ret = 0;
}
out:
scoutfs_alloc_prepare_commit(sb, &alloc, &wri);
if (ret == 0)
ret = scoutfs_block_writer_write(sb, &wri);
scoutfs_block_writer_forget_all(sb, &wri);
comp.meta_avail = alloc.avail;
comp.meta_freed = alloc.freed;
if (ret < 0)
le64_add_cpu(&comp.flags, SCOUTFS_LOG_MERGE_COMP_ERROR);
ret = scoutfs_client_commit_log_merge(sb, &comp);
kfree(rhead);
list_for_each_entry_safe(rhead, tmp, &inputs, head)
kfree(rhead);
resched:
delay = ret == 0 ? 0 : msecs_to_jiffies(LOG_MERGE_DELAY_MS);
queue_delayed_work(finf->workq, &finf->log_merge_dwork, delay);
}
int scoutfs_forest_setup(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
@@ -593,10 +733,23 @@ int scoutfs_forest_setup(struct super_block *sb)
}
/* the finf fields will be setup as we open a transaction */
finf->sb = sb;
mutex_init(&finf->mutex);
mutex_init(&finf->srch_mutex);
INIT_DELAYED_WORK(&finf->log_merge_dwork,
scoutfs_forest_log_merge_worker);
sbi->forest_info = finf;
finf->workq = alloc_workqueue("scoutfs_log_merge", WQ_NON_REENTRANT |
WQ_UNBOUND | WQ_HIGHPRI, 0);
if (!finf->workq) {
ret = -ENOMEM;
goto out;
}
queue_delayed_work(finf->workq, &finf->log_merge_dwork,
msecs_to_jiffies(LOG_MERGE_DELAY_MS));
ret = 0;
out:
if (ret)
@@ -605,6 +758,16 @@ out:
return 0;
}
void scoutfs_forest_stop(struct super_block *sb)
{
DECLARE_FOREST_INFO(sb, finf);
if (finf && finf->workq) {
cancel_delayed_work_sync(&finf->log_merge_dwork);
destroy_workqueue(finf->workq);
}
}
void scoutfs_forest_destroy(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
@@ -612,6 +775,7 @@ void scoutfs_forest_destroy(struct super_block *sb)
if (finf) {
scoutfs_block_put(sb, finf->srch_bl);
kfree(finf);
sbi->forest_info = NULL;
}

View File

@@ -23,10 +23,10 @@ int scoutfs_forest_read_items(struct super_block *sb,
scoutfs_forest_item_cb cb, void *arg);
int scoutfs_forest_set_bloom_bits(struct super_block *sb,
struct scoutfs_lock *lock);
void scoutfs_forest_set_max_vers(struct super_block *sb, u64 max_vers);
int scoutfs_forest_get_max_vers(struct super_block *sb,
struct scoutfs_super_block *super,
u64 *vers);
void scoutfs_forest_set_max_seq(struct super_block *sb, u64 max_seq);
int scoutfs_forest_get_max_seq(struct super_block *sb,
struct scoutfs_super_block *super,
u64 *seq);
int scoutfs_forest_insert_list(struct super_block *sb,
struct scoutfs_btree_item_list *lst);
int scoutfs_forest_srch_add(struct super_block *sb, u64 hash, u64 ino, u64 id);
@@ -39,6 +39,7 @@ void scoutfs_forest_get_btrees(struct super_block *sb,
struct scoutfs_log_trees *lt);
int scoutfs_forest_setup(struct super_block *sb);
void scoutfs_forest_stop(struct super_block *sb);
void scoutfs_forest_destroy(struct super_block *sb);
#endif

View File

@@ -195,9 +195,6 @@ struct scoutfs_key {
#define sklt_rid _sk_first
#define sklt_nr _sk_second
/* lock clients */
#define sklc_rid _sk_first
/* seqs */
#define skts_trans_seq _sk_first
#define skts_rid _sk_second
@@ -206,11 +203,12 @@ struct scoutfs_key {
#define skmc_rid _sk_first
/* free extents by blkno */
#define skfb_end _sk_second
#define skfb_len _sk_third
/* free extents by len */
#define skfl_neglen _sk_second
#define skfl_blkno _sk_third
#define skfb_end _sk_first
#define skfb_len _sk_second
/* free extents by order */
#define skfo_revord _sk_first
#define skfo_end _sk_second
#define skfo_len _sk_third
struct scoutfs_avl_root {
__le16 node;
@@ -259,7 +257,7 @@ struct scoutfs_btree_block {
__le16 mid_free_len;
__u8 level;
__u8 __pad[7];
struct scoutfs_btree_item items[0];
struct scoutfs_btree_item items[];
/* leaf blocks have a fixed size item offset hash table at the end */
};
@@ -288,9 +286,10 @@ struct scoutfs_alloc_list_head {
struct scoutfs_block_ref ref;
__le64 total_nr;
__le32 first_nr;
__u8 __pad[4];
__le32 flags;
};
/*
* While the main allocator uses extent items in btree blocks, metadata
* allocations for a single transaction are recorded in arrays in
@@ -307,7 +306,7 @@ struct scoutfs_alloc_list_block {
struct scoutfs_block_ref next;
__le32 start;
__le32 nr;
__le64 blknos[0]; /* naturally aligned for sorting */
__le64 blknos[]; /* naturally aligned for sorting */
};
#define SCOUTFS_ALLOC_LIST_MAX_BLOCKS \
@@ -319,17 +318,25 @@ struct scoutfs_alloc_list_block {
*/
struct scoutfs_alloc_root {
__le64 total_len;
__le32 flags;
__le32 _pad;
struct scoutfs_btree_root root;
};
/* Shared by _alloc_list_head and _alloc_root */
#define SCOUTFS_ALLOC_FLAG_LOW (1U << 0)
/* types of allocators, exposed to alloc_detail ioctl */
#define SCOUTFS_ALLOC_OWNER_NONE 0
#define SCOUTFS_ALLOC_OWNER_SERVER 1
#define SCOUTFS_ALLOC_OWNER_MOUNT 2
#define SCOUTFS_ALLOC_OWNER_SRCH 3
#define SCOUTFS_ALLOC_OWNER_LOG_MERGE 4
struct scoutfs_mounted_client_btree_val {
union scoutfs_inet_addr addr;
__u8 flags;
__u8 __pad[7];
};
#define SCOUTFS_MOUNTED_CLIENT_QUORUM (1 << 0)
@@ -362,7 +369,7 @@ struct scoutfs_srch_file {
struct scoutfs_srch_parent {
struct scoutfs_block_header hdr;
struct scoutfs_block_ref refs[0];
struct scoutfs_block_ref refs[];
};
#define SCOUTFS_SRCH_PARENT_REFS \
@@ -377,7 +384,7 @@ struct scoutfs_srch_block {
struct scoutfs_srch_entry tail;
__le32 entry_nr;
__le32 entry_bytes;
__u8 entries[0];
__u8 entries[];
};
/*
@@ -430,6 +437,10 @@ struct scoutfs_srch_compact {
/* client -> server: compaction failed */
#define SCOUTFS_SRCH_COMPACT_FLAG_ERROR (1 << 5)
#define SCOUTFS_DATA_ALLOC_MAX_ZONES 1024
#define SCOUTFS_DATA_ALLOC_ZONE_BYTES DIV_ROUND_UP(SCOUTFS_DATA_ALLOC_MAX_ZONES, 8)
#define SCOUTFS_DATA_ALLOC_ZONE_LE64S DIV_ROUND_UP(SCOUTFS_DATA_ALLOC_MAX_ZONES, 64)
/*
* XXX I imagine we should rename these now that they've evolved to track
* all the btrees that clients use during a transaction. It's not just
@@ -443,16 +454,21 @@ struct scoutfs_log_trees {
struct scoutfs_alloc_root data_avail;
struct scoutfs_alloc_root data_freed;
struct scoutfs_srch_file srch_file;
__le64 max_item_vers;
__le64 data_alloc_zone_blocks;
__le64 data_alloc_zones[SCOUTFS_DATA_ALLOC_ZONE_LE64S];
__le64 max_item_seq;
__le64 rid;
__le64 nr;
__le64 flags;
};
#define SCOUTFS_LOG_TREES_FINALIZED (1ULL << 0)
struct scoutfs_log_item_value {
__le64 vers;
__le64 seq;
__u8 flags;
__u8 __pad[7];
__u8 data[0];
__u8 data[];
};
/*
@@ -467,7 +483,7 @@ struct scoutfs_log_item_value {
struct scoutfs_bloom_block {
struct scoutfs_block_header hdr;
__le64 total_set;
__le64 bits[0];
__le64 bits[];
};
/*
@@ -484,27 +500,105 @@ struct scoutfs_bloom_block {
member_sizeof(struct scoutfs_bloom_block, bits[0]) * 8)
#define SCOUTFS_FOREST_BLOOM_FUNC_BITS (SCOUTFS_BLOCK_LG_SHIFT + 3)
/*
* A private server btree item which records the status of a log merge
* operation that is in progress.
*/
struct scoutfs_log_merge_status {
struct scoutfs_key next_range_key;
__le64 nr_requests;
__le64 nr_complete;
__le64 last_seq;
__le64 seq;
};
/*
* A request is sent to the client and stored in a server btree item to
* record resources that would be reclaimed if the client failed. It
* has all the inputs needed for the client to perform its portion of a
* merge.
*/
struct scoutfs_log_merge_request {
struct scoutfs_alloc_list_head meta_avail;
struct scoutfs_alloc_list_head meta_freed;
struct scoutfs_btree_root logs_root;
struct scoutfs_btree_root root;
struct scoutfs_key start;
struct scoutfs_key end;
__le64 last_seq;
__le64 rid;
__le64 seq;
__le64 flags;
};
/* request root is subtree of fs root at parent, restricted merging modifications */
#define SCOUTFS_LOG_MERGE_REQUEST_SUBTREE (1ULL << 0)
/*
* The output of a client's merge of log btree items into a subtree
* rooted at a parent in the fs_root. The client sends it to the
* server, who stores it in a btree item for later splicing/rebalancing.
*/
struct scoutfs_log_merge_complete {
struct scoutfs_alloc_list_head meta_avail;
struct scoutfs_alloc_list_head meta_freed;
struct scoutfs_btree_root root;
struct scoutfs_key start;
struct scoutfs_key end;
struct scoutfs_key remain;
__le64 rid;
__le64 seq;
__le64 flags;
};
/* merge failed, ignore completion and reclaim stored request */
#define SCOUTFS_LOG_MERGE_COMP_ERROR (1ULL << 0)
/* merge didn't complete range, restart from remain */
#define SCOUTFS_LOG_MERGE_COMP_REMAIN (1ULL << 1)
/*
* Range items record the ranges of the fs keyspace that still need to
* be merged. They're added as a merge starts, removed as requests are
* sent and added back if the request didn't consume its entire range.
*/
struct scoutfs_log_merge_range {
struct scoutfs_key start;
struct scoutfs_key end;
};
struct scoutfs_log_merge_freeing {
struct scoutfs_btree_root root;
struct scoutfs_key key;
__le64 seq;
};
/*
* Keys are first sorted by major key zones.
*/
#define SCOUTFS_INODE_INDEX_ZONE 1
#define SCOUTFS_RID_ZONE 2
#define SCOUTFS_ORPHAN_ZONE 2
#define SCOUTFS_FS_ZONE 3
#define SCOUTFS_LOCK_ZONE 4
/* Items only stored in server btrees */
#define SCOUTFS_LOG_TREES_ZONE 6
#define SCOUTFS_LOCK_CLIENTS_ZONE 7
#define SCOUTFS_TRANS_SEQ_ZONE 8
#define SCOUTFS_MOUNTED_CLIENT_ZONE 9
#define SCOUTFS_SRCH_ZONE 10
#define SCOUTFS_FREE_EXTENT_ZONE 11
#define SCOUTFS_TRANS_SEQ_ZONE 7
#define SCOUTFS_MOUNTED_CLIENT_ZONE 8
#define SCOUTFS_SRCH_ZONE 9
#define SCOUTFS_FREE_EXTENT_BLKNO_ZONE 10
#define SCOUTFS_FREE_EXTENT_ORDER_ZONE 11
/* Items only stored in log merge server btrees */
#define SCOUTFS_LOG_MERGE_STATUS_ZONE 12
#define SCOUTFS_LOG_MERGE_RANGE_ZONE 13
#define SCOUTFS_LOG_MERGE_REQUEST_ZONE 14
#define SCOUTFS_LOG_MERGE_COMPLETE_ZONE 15
#define SCOUTFS_LOG_MERGE_FREEING_ZONE 16
/* inode index zone */
#define SCOUTFS_INODE_INDEX_META_SEQ_TYPE 1
#define SCOUTFS_INODE_INDEX_DATA_SEQ_TYPE 2
#define SCOUTFS_INODE_INDEX_NR 3 /* don't forget to update */
/* rid zone (also used in server alloc btree) */
/* orphan zone, redundant type used for clarity */
#define SCOUTFS_ORPHAN_TYPE 1
/* fs zone */
@@ -525,10 +619,6 @@ struct scoutfs_bloom_block {
#define SCOUTFS_SRCH_PENDING_TYPE 3
#define SCOUTFS_SRCH_BUSY_TYPE 4
/* free extents in allocator btrees in client and server, by blkno or len */
#define SCOUTFS_FREE_EXTENT_BLKNO_TYPE 1
#define SCOUTFS_FREE_EXTENT_LEN_TYPE 2
/* file data extents have start and len in key */
struct scoutfs_data_extent_val {
__le64 blkno;
@@ -549,7 +639,7 @@ struct scoutfs_xattr {
__le16 val_len;
__u8 name_len;
__u8 __pad[5];
__u8 name[0];
__u8 name[];
};
@@ -586,6 +676,12 @@ struct scoutfs_xattr {
#define SCOUTFS_QUORUM_HB_IVAL_MS 100
#define SCOUTFS_QUORUM_HB_TIMEO_MS (5 * MSEC_PER_SEC)
/*
* A newly elected leader will give fencing some time before giving up and
* shutting down.
*/
#define SCOUTFS_QUORUM_FENCE_TO_MS (15 * MSEC_PER_SEC)
struct scoutfs_quorum_message {
__le64 fsid;
__le64 version;
@@ -617,18 +713,60 @@ struct scoutfs_quorum_config {
} slots[SCOUTFS_QUORUM_MAX_SLOTS];
};
struct scoutfs_quorum_block {
struct scoutfs_block_header hdr;
__le64 term;
__le64 random_write_mark;
__le64 flags;
struct scoutfs_quorum_block_event {
__le64 rid;
struct scoutfs_timespec ts;
} write, update_term, set_leader, clear_leader, fenced;
enum {
SCOUTFS_QUORUM_EVENT_BEGIN, /* quorum service starting up */
SCOUTFS_QUORUM_EVENT_TERM, /* updated persistent term */
SCOUTFS_QUORUM_EVENT_ELECT, /* won election */
SCOUTFS_QUORUM_EVENT_FENCE, /* server fenced others */
SCOUTFS_QUORUM_EVENT_STOP, /* server stopped */
SCOUTFS_QUORUM_EVENT_END, /* quorum service shutting down */
SCOUTFS_QUORUM_EVENT_NR,
};
#define SCOUTFS_QUORUM_BLOCK_LEADER (1 << 0)
struct scoutfs_quorum_block {
struct scoutfs_block_header hdr;
struct scoutfs_quorum_block_event {
__le64 rid;
__le64 term;
struct scoutfs_timespec ts;
} events[SCOUTFS_QUORUM_EVENT_NR];
};
/*
* Tunable options that apply to the entire system. They can be set in
* mkfs or in sysfs files which send an rpc to the server to make the
* change. The super version defines the options that exist.
*
* @set_bits: bits for each 64bit starting offset after set_bits
* indicate which logical option is set.
*
* @data_alloc_zone_blocks: if set, the data device is logically divided
* into contiguous zones of this many blocks. Data allocation will try
* and isolate allocated extents for each mount to their own zone. The
* zone size must be larger than the data alloc high water mark and
* large enough such that the number of zones is kept within its static
* limit.
*/
struct scoutfs_volume_options {
__le64 set_bits;
__le64 data_alloc_zone_blocks;
__le64 __future_expansion[63];
};
#define scoutfs_volopt_nr(field) \
((offsetof(struct scoutfs_volume_options, field) - \
(offsetof(struct scoutfs_volume_options, set_bits) + \
member_sizeof(struct scoutfs_volume_options, set_bits))) / sizeof(__le64))
#define scoutfs_volopt_bit(field) \
(1ULL << scoutfs_volopt_nr(field))
#define SCOUTFS_VOLOPT_DATA_ALLOC_ZONE_BLOCKS_NR \
scoutfs_volopt_nr(data_alloc_zone_blocks)
#define SCOUTFS_VOLOPT_DATA_ALLOC_ZONE_BLOCKS_BIT \
scoutfs_volopt_bit(data_alloc_zone_blocks)
#define SCOUTFS_VOLOPT_EXPANSION_BITS \
(~(scoutfs_volopt_bit(__future_expansion) - 1))
#define SCOUTFS_FLAG_IS_META_BDEV 0x01
@@ -638,8 +776,8 @@ struct scoutfs_super_block {
__le64 version;
__le64 flags;
__u8 uuid[SCOUTFS_UUID_BYTES];
__le64 seq;
__le64 next_ino;
__le64 next_trans_seq;
__le64 total_meta_blocks; /* both static and dynamic */
__le64 first_meta_blkno; /* first dynamically allocated */
__le64 last_meta_blkno;
@@ -653,10 +791,11 @@ struct scoutfs_super_block {
struct scoutfs_alloc_list_head server_meta_freed[2];
struct scoutfs_btree_root fs_root;
struct scoutfs_btree_root logs_root;
struct scoutfs_btree_root lock_clients;
struct scoutfs_btree_root log_merge;
struct scoutfs_btree_root trans_seqs;
struct scoutfs_btree_root mounted_clients;
struct scoutfs_btree_root srch_root;
struct scoutfs_volume_options volopt;
};
#define SCOUTFS_ROOT_INO 1
@@ -682,7 +821,6 @@ struct scoutfs_super_block {
* online by staging.
*
* XXX
* - otime?
* - compat flags?
* - version?
* - generation?
@@ -706,6 +844,7 @@ struct scoutfs_inode {
struct scoutfs_timespec atime;
struct scoutfs_timespec ctime;
struct scoutfs_timespec mtime;
struct scoutfs_timespec crtime;
};
#define SCOUTFS_INO_FLAG_TRUNCATE 0x1
@@ -729,7 +868,7 @@ struct scoutfs_dirent {
__le64 pos;
__u8 type;
__u8 __pad[7];
__u8 name[0];
__u8 name[];
};
#define SCOUTFS_NAME_LEN 255
@@ -827,7 +966,7 @@ struct scoutfs_net_header {
__u8 flags;
__u8 error;
__u8 __pad[3];
__u8 data[0];
__u8 data[];
};
#define SCOUTFS_NET_FLAG_RESPONSE (1 << 0)
@@ -845,6 +984,12 @@ enum scoutfs_net_cmd {
SCOUTFS_NET_CMD_LOCK_RECOVER,
SCOUTFS_NET_CMD_SRCH_GET_COMPACT,
SCOUTFS_NET_CMD_SRCH_COMMIT_COMPACT,
SCOUTFS_NET_CMD_GET_LOG_MERGE,
SCOUTFS_NET_CMD_COMMIT_LOG_MERGE,
SCOUTFS_NET_CMD_OPEN_INO_MAP,
SCOUTFS_NET_CMD_GET_VOLOPT,
SCOUTFS_NET_CMD_SET_VOLOPT,
SCOUTFS_NET_CMD_CLEAR_VOLOPT,
SCOUTFS_NET_CMD_FAREWELL,
SCOUTFS_NET_CMD_UNKNOWN,
};
@@ -889,21 +1034,16 @@ struct scoutfs_net_roots {
struct scoutfs_net_lock {
struct scoutfs_key key;
__le64 write_version;
__le64 write_seq;
__u8 old_mode;
__u8 new_mode;
__u8 __pad[6];
};
struct scoutfs_net_lock_grant_response {
struct scoutfs_net_lock nl;
struct scoutfs_net_roots roots;
};
struct scoutfs_net_lock_recover {
__le16 nr;
__u8 __pad[6];
struct scoutfs_net_lock locks[0];
struct scoutfs_net_lock locks[];
};
#define SCOUTFS_NET_LOCK_MAX_RECOVER_NR \
@@ -970,4 +1110,42 @@ enum scoutfs_corruption_sources {
#define SC_NR_LONGS DIV_ROUND_UP(SC_NR_SOURCES, BITS_PER_LONG)
#define SCOUTFS_OPEN_INO_MAP_SHIFT 10
#define SCOUTFS_OPEN_INO_MAP_BITS (1 << SCOUTFS_OPEN_INO_MAP_SHIFT)
#define SCOUTFS_OPEN_INO_MAP_MASK (SCOUTFS_OPEN_INO_MAP_BITS - 1)
#define SCOUTFS_OPEN_INO_MAP_LE64S (SCOUTFS_OPEN_INO_MAP_BITS / 64)
/*
* The request and response conversation is as follows:
*
* client[init] -> server:
* group_nr = G
* req_id = 0 (I)
* server -> client[*]
* group_nr = G
* req_id = R
* client[*] -> server
* group_nr = G (I)
* req_id = R
* bits
* server -> client[init]
* group_nr = G (I)
* req_id = R (I)
* bits
*
* Many of the fields in individual messages are ignored ("I") because
* the net id or the omap req_id can be used to identify the
* conversation. We always include them on the wire to make inspected
* messages easier to follow.
*/
struct scoutfs_open_ino_map_args {
__le64 group_nr;
__le64 req_id;
};
struct scoutfs_open_ino_map {
struct scoutfs_open_ino_map_args args;
__le64 bits[SCOUTFS_OPEN_INO_MAP_LE64S];
};
#endif

View File

@@ -33,6 +33,8 @@
#include "item.h"
#include "client.h"
#include "cmp.h"
#include "omap.h"
#include "forest.h"
/*
* XXX
@@ -53,10 +55,19 @@ struct inode_allocator {
};
struct inode_sb_info {
struct super_block *sb;
bool stopped;
spinlock_t writeback_lock;
struct rb_root writeback_inodes;
struct inode_allocator dir_ino_alloc;
struct inode_allocator ino_alloc;
struct delayed_work orphan_scan_dwork;
/* serialize multiple inode ->evict trying to delete same ino's items */
spinlock_t deleting_items_lock;
struct list_head deleting_items_list;
};
#define DECLARE_INODE_SB_INFO(sb, name) \
@@ -82,6 +93,8 @@ static void scoutfs_inode_ctor(void *obj)
init_waitqueue_head(&si->data_waitq.waitq);
init_rwsem(&si->xattr_rwsem);
RB_CLEAR_NODE(&si->writeback_node);
scoutfs_lock_init_coverage(&si->ino_lock_cov);
atomic_set(&si->inv_iput_count, 0);
inode_init_once(&si->inode);
}
@@ -141,12 +154,15 @@ static void remove_writeback_inode(struct inode_sb_info *inf,
void scoutfs_destroy_inode(struct inode *inode)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
DECLARE_INODE_SB_INFO(inode->i_sb, inf);
spin_lock(&inf->writeback_lock);
remove_writeback_inode(inf, SCOUTFS_I(inode));
spin_unlock(&inf->writeback_lock);
scoutfs_lock_del_coverage(inode->i_sb, &si->ino_lock_cov);
call_rcu(&inode->i_rcu, scoutfs_i_callback);
}
@@ -182,7 +198,8 @@ static void set_inode_ops(struct inode *inode)
inode->i_fop = &scoutfs_file_fops;
break;
case S_IFDIR:
inode->i_op = &scoutfs_dir_iops;
inode->i_op = &scoutfs_dir_iops.ops;
inode->i_flags |= S_IOPS_WRAPPER;
inode->i_fop = &scoutfs_dir_fops;
break;
case S_IFLNK:
@@ -245,6 +262,8 @@ static void load_inode(struct inode *inode, struct scoutfs_inode *cinode)
si->next_readdir_pos = le64_to_cpu(cinode->next_readdir_pos);
si->next_xattr_id = le64_to_cpu(cinode->next_xattr_id);
si->flags = le32_to_cpu(cinode->flags);
si->crtime.tv_sec = le64_to_cpu(cinode->crtime.sec);
si->crtime.tv_nsec = le32_to_cpu(cinode->crtime.nsec);
/*
* i_blocks is initialized from online and offline and is then
@@ -306,6 +325,8 @@ int scoutfs_inode_refresh(struct inode *inode, struct scoutfs_lock *lock,
if (ret == 0) {
load_inode(inode, &sinode);
atomic64_set(&si->last_refreshed, refresh_gen);
scoutfs_lock_add_coverage(sb, lock, &si->ino_lock_cov);
si->drop_invalidated = false;
}
} else {
ret = 0;
@@ -343,7 +364,7 @@ static int set_inode_size(struct inode *inode, struct scoutfs_lock *lock,
if (!S_ISREG(inode->i_mode))
return 0;
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, true);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, true, false);
if (ret)
return ret;
@@ -370,7 +391,7 @@ static int clear_truncate_flag(struct inode *inode, struct scoutfs_lock *lock)
LIST_HEAD(ind_locks);
int ret;
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false, false);
if (ret)
return ret;
@@ -485,7 +506,7 @@ retry:
}
}
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, false, false);
if (ret)
goto out;
@@ -639,12 +660,22 @@ void scoutfs_inode_get_onoff(struct inode *inode, s64 *on, s64 *off)
} while (read_seqcount_retry(&si->seqcount, seq));
}
/*
* We have inversions between getting cluster locks while performing
* final deletion on a freeing inode and waiting on a freeing inode
* while holding a cluster lock.
*
* We can avoid these deadlocks by hiding freeing inodes in our hash
* lookup function. We're fine with either returning null or populating
* a new inode overlapping with eviction freeing a previous instance of
* the inode.
*/
static int scoutfs_iget_test(struct inode *inode, void *arg)
{
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
u64 *ino = arg;
return si->ino == *ino;
return (si->ino == *ino) && !(inode->i_state & I_FREEING);
}
static int scoutfs_iget_set(struct inode *inode, void *arg)
@@ -687,6 +718,8 @@ struct inode *scoutfs_iget(struct super_block *sb, u64 ino)
atomic64_set(&si->last_refreshed, 0);
ret = scoutfs_inode_refresh(inode, lock, 0);
if (ret == 0)
ret = scoutfs_omap_inc(sb, ino);
if (ret) {
iget_failed(inode);
inode = ERR_PTR(ret);
@@ -733,6 +766,9 @@ static void store_inode(struct scoutfs_inode *cinode, struct inode *inode)
cinode->next_readdir_pos = cpu_to_le64(si->next_readdir_pos);
cinode->next_xattr_id = cpu_to_le64(si->next_xattr_id);
cinode->flags = cpu_to_le32(si->flags);
cinode->crtime.sec = cpu_to_le64(si->crtime.tv_sec);
cinode->crtime.nsec = cpu_to_le32(si->crtime.tv_nsec);
memset(cinode->crtime.__pad, 0, sizeof(cinode->crtime.__pad));
}
/*
@@ -1186,7 +1222,7 @@ int scoutfs_inode_index_start(struct super_block *sb, u64 *seq)
* Returns > 0 if the seq changed and the locks should be retried.
*/
int scoutfs_inode_index_try_lock_hold(struct super_block *sb,
struct list_head *list, u64 seq)
struct list_head *list, u64 seq, bool allocing)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct index_lock *ind_lock;
@@ -1202,7 +1238,7 @@ int scoutfs_inode_index_try_lock_hold(struct super_block *sb,
goto out;
}
ret = scoutfs_hold_trans(sb);
ret = scoutfs_hold_trans(sb, allocing);
if (ret == 0 && seq != sbi->trans_seq) {
scoutfs_release_trans(sb);
ret = 1;
@@ -1216,7 +1252,7 @@ out:
}
int scoutfs_inode_index_lock_hold(struct inode *inode, struct list_head *list,
bool set_data_seq)
bool set_data_seq, bool allocing)
{
struct super_block *sb = inode->i_sb;
int ret;
@@ -1226,7 +1262,7 @@ int scoutfs_inode_index_lock_hold(struct inode *inode, struct list_head *list,
ret = scoutfs_inode_index_start(sb, &seq) ?:
scoutfs_inode_index_prepare(sb, list, inode,
set_data_seq) ?:
scoutfs_inode_index_try_lock_hold(sb, list, seq);
scoutfs_inode_index_try_lock_hold(sb, list, seq, allocing);
} while (ret > 0);
return ret;
@@ -1383,6 +1419,8 @@ struct inode *scoutfs_new_inode(struct super_block *sb, struct inode *dir,
si->next_xattr_id = 0;
si->have_item = false;
atomic64_set(&si->last_refreshed, lock->refresh_gen);
scoutfs_lock_add_coverage(sb, lock, &si->ino_lock_cov);
si->drop_invalidated = false;
si->flags = 0;
scoutfs_inode_set_meta_seq(inode);
@@ -1398,52 +1436,115 @@ struct inode *scoutfs_new_inode(struct super_block *sb, struct inode *dir,
store_inode(&sinode, inode);
init_inode_key(&key, scoutfs_ino(inode));
ret = scoutfs_omap_inc(sb, ino);
if (ret < 0)
goto out;
ret = scoutfs_item_create(sb, &key, &sinode, sizeof(sinode), lock);
if (ret < 0)
scoutfs_omap_dec(sb, ino);
out:
if (ret) {
iput(inode);
return ERR_PTR(ret);
inode = ERR_PTR(ret);
}
return inode;
}
static void init_orphan_key(struct scoutfs_key *key, u64 rid, u64 ino)
static void init_orphan_key(struct scoutfs_key *key, u64 ino)
{
*key = (struct scoutfs_key) {
.sk_zone = SCOUTFS_RID_ZONE,
.sko_rid = cpu_to_le64(rid),
.sk_type = SCOUTFS_ORPHAN_TYPE,
.sk_zone = SCOUTFS_ORPHAN_ZONE,
.sko_ino = cpu_to_le64(ino),
.sk_type = SCOUTFS_ORPHAN_TYPE,
};
}
static int remove_orphan_item(struct super_block *sb, u64 ino)
/*
* Create an orphan item. The orphan items are maintained in their own
* zone under a write only lock while the caller has the inode protected
* by a write lock.
*/
int scoutfs_inode_orphan_create(struct super_block *sb, u64 ino, struct scoutfs_lock *lock)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_lock *lock = sbi->rid_lock;
struct scoutfs_key key;
int ret;
init_orphan_key(&key, sbi->rid, ino);
init_orphan_key(&key, ino);
ret = scoutfs_item_delete(sb, &key, lock);
if (ret == -ENOENT)
ret = 0;
return scoutfs_item_create_force(sb, &key, NULL, 0, lock);
}
return ret;
int scoutfs_inode_orphan_delete(struct super_block *sb, u64 ino, struct scoutfs_lock *lock)
{
struct scoutfs_key key;
init_orphan_key(&key, ino);
return scoutfs_item_delete_force(sb, &key, lock);
}
struct deleting_ino_entry {
struct list_head head;
u64 ino;
};
static bool added_deleting_ino(struct inode_sb_info *inf, struct deleting_ino_entry *del, u64 ino)
{
struct deleting_ino_entry *tmp;
bool added = true;
spin_lock(&inf->deleting_items_lock);
list_for_each_entry(tmp, &inf->deleting_items_list, head) {
if (tmp->ino == ino) {
added = false;
break;
}
}
if (added) {
del->ino = ino;
list_add_tail(&del->head, &inf->deleting_items_list);
}
spin_unlock(&inf->deleting_items_lock);
return added;
}
static void del_deleting_ino(struct inode_sb_info *inf, struct deleting_ino_entry *del)
{
if (del->ino) {
spin_lock(&inf->deleting_items_lock);
list_del_init(&del->head);
spin_unlock(&inf->deleting_items_lock);
}
}
/*
* Remove all the items associated with a given inode. This is only
* called once nlink has dropped to zero so we don't have to worry about
* dirents referencing the inode or link backrefs. Dropping nlink to 0
* also created an orphan item. That orphan item will continue
* triggering attempts to finish previous partial deletion until all
* deletion is complete and the orphan item is removed.
* called once nlink has dropped to zero and nothing has the inode open
* so we don't have to worry about dirents referencing the inode or link
* backrefs. Dropping nlink to 0 also created an orphan item. That
* orphan item will continue triggering attempts to finish previous
* partial deletion until all deletion is complete and the orphan item
* is removed.
*
* Currently this can be called multiple times for multiple cached
* inodes for a given ino number (ilookup avoids freeing inodes to avoid
* cluster lock<->inode flag waiting inversions). Some items are not
* safe to delete concurrently, for example concurrent data truncation
* could free extents multiple times. We use a very silly list of inos
* being deleted. Duplicates just return success. If the first
* deletion ends up failing orphan deletion will come back around later
* and retry.
*/
static int delete_inode_items(struct super_block *sb, u64 ino)
static int delete_inode_items(struct super_block *sb, u64 ino, struct scoutfs_lock *lock,
struct scoutfs_lock *orph_lock)
{
struct scoutfs_lock *lock = NULL;
DECLARE_INODE_SB_INFO(sb, inf);
struct deleting_ino_entry del = {{NULL, }};
struct scoutfs_inode sinode;
struct scoutfs_key key;
LIST_HEAD(ind_locks);
@@ -1453,9 +1554,10 @@ static int delete_inode_items(struct super_block *sb, u64 ino)
u64 size;
int ret;
ret = scoutfs_lock_ino(sb, SCOUTFS_LOCK_WRITE, 0, ino, &lock);
if (ret)
return ret;
if (!added_deleting_ino(inf, &del, ino)) {
ret = 0;
goto out;
}
init_inode_key(&key, ino);
@@ -1494,7 +1596,7 @@ static int delete_inode_items(struct super_block *sb, u64 ino)
retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
prepare_index_deletion(sb, &ind_locks, ino, mode, &sinode) ?:
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq);
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq, false);
if (ret > 0)
goto retry;
if (ret)
@@ -1516,23 +1618,36 @@ retry:
if (ret)
goto out;
ret = remove_orphan_item(sb, ino);
ret = scoutfs_inode_orphan_delete(sb, ino, orph_lock);
out:
del_deleting_ino(inf, &del);
if (release)
scoutfs_release_trans(sb);
scoutfs_inode_index_unlock(sb, &ind_locks);
scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
return ret;
}
/*
* iput_final has already written out the dirty pages to the inode
* before we get here. We're left with a clean inode that we have to
* tear down. If there are no more links to the inode then we also
* remove all its persistent structures.
* tear down. We use locking and open inode number bitmaps to decide if
* we should finally destroy an inode that is no longer open nor
* reachable through directory entries.
*
* Because lookup ignores freeing inodes we can get here from multiple
* instances of an inode that is being deleted. Orphan scanning in
* particular can race with deletion. delete_inode_items() resolves
* concurrent attempts.
*/
void scoutfs_evict_inode(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
const u64 ino = scoutfs_ino(inode);
struct scoutfs_lock *orph_lock;
struct scoutfs_lock *lock;
int ret;
trace_scoutfs_evict_inode(inode->i_sb, scoutfs_ino(inode),
inode->i_nlink, is_bad_inode(inode));
@@ -1541,83 +1656,190 @@ void scoutfs_evict_inode(struct inode *inode)
truncate_inode_pages_final(&inode->i_data);
if (inode->i_nlink == 0)
delete_inode_items(inode->i_sb, scoutfs_ino(inode));
ret = scoutfs_omap_should_delete(sb, inode, &lock, &orph_lock);
if (ret > 0) {
ret = delete_inode_items(inode->i_sb, scoutfs_ino(inode), lock, orph_lock);
scoutfs_unlock(sb, lock, SCOUTFS_LOCK_WRITE);
scoutfs_unlock(sb, orph_lock, SCOUTFS_LOCK_WRITE_ONLY);
}
if (ret == -ERESTARTSYS) {
/* can be in task with pending, could be found as orphan */
scoutfs_inc_counter(sb, inode_evict_intr);
ret = 0;
}
if (ret < 0) {
scoutfs_err(sb, "error %d while checking to delete inode nr %llu, it might linger.",
ret, ino);
}
scoutfs_omap_dec(sb, ino);
clear:
clear_inode(inode);
}
/*
* We want to remove inodes from the cache as their count goes to 0 if
* they're no longer covered by a cluster lock or if while locked they
* were unlinked.
*
* We don't want unused cached inodes to linger outside of cluster
* locking so that they don't prevent final inode deletion on other
* nodes. We don't have specific per-inode or per-dentry locks which
* would otherwise remove the stale caches as they're invalidated.
* Stale cached inodes provide little value because they're going to be
* refreshed the next time they're locked. Populating the item cache
* and loading the inode item is a lot more expensive than initializing
* and inserting a newly allocated vfs inode.
*/
int scoutfs_drop_inode(struct inode *inode)
{
int ret = generic_drop_inode(inode);
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct super_block *sb = inode->i_sb;
trace_scoutfs_drop_inode(inode->i_sb, scoutfs_ino(inode),
inode->i_nlink, inode_unhashed(inode));
return ret;
trace_scoutfs_drop_inode(sb, scoutfs_ino(inode), inode->i_nlink, inode_unhashed(inode),
si->drop_invalidated);
return si->drop_invalidated || !scoutfs_lock_is_covered(sb, &si->ino_lock_cov) ||
generic_drop_inode(inode);
}
/*
* Find orphan items and process each one.
*
* Runtime of this will be bounded by the number of orphans, which could
* theoretically be very large. If that becomes a problem we might want to push
* this work off to a thread.
*
* This only scans orphans for this node. This will need to be covered by
* the rest of node zone cleanup.
* All mounts are performing this work concurrently. We introduce
* significant jitter between them to try and keep them from all
* bunching up and working on the same inodes.
*/
int scoutfs_scan_orphans(struct super_block *sb)
static void schedule_orphan_dwork(struct inode_sb_info *inf)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_lock *lock = sbi->rid_lock;
struct scoutfs_key key;
#define ORPHAN_SCAN_MIN_MS (10 * MSEC_PER_SEC)
#define ORPHAN_SCAN_JITTER_MS (40 * MSEC_PER_SEC)
unsigned long delay = msecs_to_jiffies(ORPHAN_SCAN_MIN_MS +
prandom_u32_max(ORPHAN_SCAN_JITTER_MS));
if (!inf->stopped) {
delay = msecs_to_jiffies(ORPHAN_SCAN_MIN_MS +
prandom_u32_max(ORPHAN_SCAN_JITTER_MS));
schedule_delayed_work(&inf->orphan_scan_dwork, delay);
}
}
/*
* Find and delete inodes whose only remaining reference is the
* persistent orphan item that was created as they were unlinked.
*
* Orphan items are created as the final directory entry referring to an
* inode is deleted. They're deleted as the final cached inode is
* evicted and the inode items are destroyed. They can linger if all
* the cached inodes pinning the inode fail to delete as they are
* evicted from the cache -- either through crashing or errors.
*
* This work runs in all mounts in the background looking for orphaned
* inodes that should be deleted.
*
* We use the forest hint call to read the persistent forest trees
* looking for orphan items without creating lock contention. Orphan
* items exist for O_TMPFILE users and we don't want to force them to
* commit by trying to acquire a conflicting read lock the orphan zone.
* There's no rush to reclaim deleted items, eventually they will be
* found in the persistent item btrees.
*
* Once we find candidate orphan items we can first check our local
* inode cache for inodes that are already on their way to eviction and
* can be skipped. Then we ask the server for the open map containing
* the inode. Only if we don't have it cached, and no one else does, do
* we try and read it into our cache and evict it to trigger the final
* inode deletion process.
*
* Orphaned items that make it that far should be very rare. They can
* only exist if all the mounts that were using an inode after it had
* been unlinked (or created with o_tmpfile) didn't unmount cleanly.
*/
static void inode_orphan_scan_worker(struct work_struct *work)
{
struct inode_sb_info *inf = container_of(work, struct inode_sb_info,
orphan_scan_dwork.work);
struct super_block *sb = inf->sb;
struct scoutfs_open_ino_map omap;
struct scoutfs_key last;
int err = 0;
struct scoutfs_key next;
struct scoutfs_key key;
struct inode *inode;
u64 group_nr;
int bit_nr;
u64 ino;
int ret;
trace_scoutfs_scan_orphans(sb);
scoutfs_inc_counter(sb, orphan_scan);
init_orphan_key(&key, sbi->rid, 0);
init_orphan_key(&last, sbi->rid, ~0ULL);
init_orphan_key(&last, U64_MAX);
omap.args.group_nr = cpu_to_le64(U64_MAX);
while (1) {
ret = scoutfs_item_next(sb, &key, &last, NULL, 0, lock);
if (ret == -ENOENT) /* No more orphan items */
break;
if (ret < 0)
for (ino = SCOUTFS_ROOT_INO + 1; ino != 0; ino++) {
if (inf->stopped) {
ret = 0;
goto out;
ret = delete_inode_items(sb, le64_to_cpu(key.sko_ino));
if (ret && ret != -ENOENT && !err)
err = ret;
if (le64_to_cpu(key.sko_ino) == U64_MAX) {
ret = -ENOENT;
break;
}
le64_add_cpu(&key.sko_ino, 1);
/* find the next orphan item */
init_orphan_key(&key, ino);
ret = scoutfs_forest_next_hint(sb, &key, &next);
if (ret < 0) {
if (ret == -ENOENT)
break;
goto out;
}
if (scoutfs_key_compare(&next, &last) > 0)
break;
scoutfs_inc_counter(sb, orphan_scan_item);
ino = le64_to_cpu(next.sko_ino);
/* locally cached inodes will already be deleted */
inode = scoutfs_ilookup(sb, ino);
if (inode) {
scoutfs_inc_counter(sb, orphan_scan_cached);
iput(inode);
continue;
}
/* get an omap that covers the orphaned ino */
group_nr = ino >> SCOUTFS_OPEN_INO_MAP_SHIFT;
bit_nr = ino & SCOUTFS_OPEN_INO_MAP_MASK;
if (le64_to_cpu(omap.args.group_nr) != group_nr) {
ret = scoutfs_client_open_ino_map(sb, group_nr, &omap);
if (ret < 0)
goto out;
}
/* don't need to evict if someone else has it open (cached) */
if (test_bit_le(bit_nr, omap.bits)) {
scoutfs_inc_counter(sb, orphan_scan_omap_set);
continue;
}
/* try to cached and evict unused inode to delete, can be racing */
inode = scoutfs_iget(sb, ino);
if (IS_ERR(inode)) {
ret = PTR_ERR(inode);
if (ret == -ENOENT)
continue;
else
goto out;
}
scoutfs_inc_counter(sb, orphan_scan_read);
SCOUTFS_I(inode)->drop_invalidated = true;
iput(inode);
}
ret = 0;
out:
return err ? err : ret;
}
if (ret < 0)
scoutfs_inc_counter(sb, orphan_scan_error);
int scoutfs_orphan_inode(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_lock *lock = sbi->rid_lock;
struct scoutfs_key key;
int ret;
trace_scoutfs_orphan_inode(sb, inode);
init_orphan_key(&key, sbi->rid, scoutfs_ino(inode));
ret = scoutfs_item_create(sb, &key, NULL, 0, lock);
return ret;
schedule_orphan_dwork(inf);
}
/*
@@ -1726,16 +1948,43 @@ int scoutfs_inode_setup(struct super_block *sb)
if (!inf)
return -ENOMEM;
inf->sb = sb;
spin_lock_init(&inf->writeback_lock);
inf->writeback_inodes = RB_ROOT;
spin_lock_init(&inf->dir_ino_alloc.lock);
spin_lock_init(&inf->ino_alloc.lock);
INIT_DELAYED_WORK(&inf->orphan_scan_dwork, inode_orphan_scan_worker);
spin_lock_init(&inf->deleting_items_lock);
INIT_LIST_HEAD(&inf->deleting_items_list);
sbi->inode_sb_info = inf;
return 0;
}
/*
* Our inode subsystem is setup pretty early but orphan scanning uses
* many other subsystems like networking and the server. We only kick
* it off once everything is ready.
*/
int scoutfs_inode_start(struct super_block *sb)
{
DECLARE_INODE_SB_INFO(sb, inf);
schedule_orphan_dwork(inf);
return 0;
}
void scoutfs_inode_stop(struct super_block *sb)
{
DECLARE_INODE_SB_INFO(sb, inf);
if (inf) {
inf->stopped = true;
cancel_delayed_work_sync(&inf->orphan_scan_dwork);
}
}
void scoutfs_inode_destroy(struct super_block *sb)
{
struct inode_sb_info *inf = SCOUTFS_SB(sb)->inode_sb_info;

View File

@@ -20,6 +20,7 @@ struct scoutfs_inode_info {
u64 online_blocks;
u64 offline_blocks;
u32 flags;
struct timespec crtime;
/*
* Protects per-inode extent items, most particularly readers
@@ -51,6 +52,13 @@ struct scoutfs_inode_info {
struct rw_semaphore xattr_rwsem;
struct rb_node writeback_node;
struct scoutfs_lock_coverage ino_lock_cov;
/* drop if i_count hits 0, allows drop while invalidate holds coverage */
bool drop_invalidated;
struct llist_node inv_iput_llnode;
atomic_t inv_iput_count;
struct inode inode;
};
@@ -68,7 +76,6 @@ struct inode *scoutfs_alloc_inode(struct super_block *sb);
void scoutfs_destroy_inode(struct inode *inode);
int scoutfs_drop_inode(struct inode *inode);
void scoutfs_evict_inode(struct inode *inode);
int scoutfs_orphan_inode(struct inode *inode);
struct inode *scoutfs_iget(struct super_block *sb, u64 ino);
struct inode *scoutfs_ilookup(struct super_block *sb, u64 ino);
@@ -82,9 +89,9 @@ int scoutfs_inode_index_prepare_ino(struct super_block *sb,
struct list_head *list, u64 ino,
umode_t mode);
int scoutfs_inode_index_try_lock_hold(struct super_block *sb,
struct list_head *list, u64 seq);
struct list_head *list, u64 seq, bool allocing);
int scoutfs_inode_index_lock_hold(struct inode *inode, struct list_head *list,
bool set_data_seq);
bool set_data_seq, bool allocing);
void scoutfs_inode_index_unlock(struct super_block *sb, struct list_head *list);
int scoutfs_dirty_inode_item(struct inode *inode, struct scoutfs_lock *lock);
@@ -113,7 +120,8 @@ int scoutfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct kstat *stat);
int scoutfs_setattr(struct dentry *dentry, struct iattr *attr);
int scoutfs_scan_orphans(struct super_block *sb);
int scoutfs_inode_orphan_create(struct super_block *sb, u64 ino, struct scoutfs_lock *lock);
int scoutfs_inode_orphan_delete(struct super_block *sb, u64 ino, struct scoutfs_lock *lock);
void scoutfs_inode_queue_writeback(struct inode *inode);
int scoutfs_inode_walk_writeback(struct super_block *sb, bool write);
@@ -124,6 +132,8 @@ void scoutfs_inode_exit(void);
int scoutfs_inode_init(void);
int scoutfs_inode_setup(struct super_block *sb);
int scoutfs_inode_start(struct super_block *sb);
void scoutfs_inode_stop(struct super_block *sb);
void scoutfs_inode_destroy(struct super_block *sb);
#endif

View File

@@ -38,6 +38,7 @@
#include "hash.h"
#include "srch.h"
#include "alloc.h"
#include "server.h"
#include "scoutfs_trace.h"
/*
@@ -540,6 +541,7 @@ out:
static long scoutfs_ioc_stat_more(struct file *file, unsigned long arg)
{
struct inode *inode = file_inode(file);
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct scoutfs_ioctl_stat_more stm;
if (get_user(stm.valid_bytes, (__u64 __user *)arg))
@@ -551,6 +553,8 @@ static long scoutfs_ioc_stat_more(struct file *file, unsigned long arg)
stm.data_seq = scoutfs_inode_data_seq(inode);
stm.data_version = scoutfs_inode_data_version(inode);
scoutfs_inode_get_onoff(inode, &stm.online_blocks, &stm.offline_blocks);
stm.crtime_sec = si->crtime.tv_sec;
stm.crtime_nsec = si->crtime.tv_nsec;
if (copy_to_user((void __user *)arg, &stm, stm.valid_bytes))
return -EFAULT;
@@ -616,6 +620,7 @@ static long scoutfs_ioc_data_waiting(struct file *file, unsigned long arg)
static long scoutfs_ioc_setattr_more(struct file *file, unsigned long arg)
{
struct inode *inode = file->f_inode;
struct scoutfs_inode_info *si = SCOUTFS_I(inode);
struct super_block *sb = inode->i_sb;
struct scoutfs_ioctl_setattr_more __user *usm = (void __user *)arg;
struct scoutfs_ioctl_setattr_more sm;
@@ -674,7 +679,7 @@ static long scoutfs_ioc_setattr_more(struct file *file, unsigned long arg)
/* setting only so we don't see 0 data seq with nonzero data_version */
set_data_seq = sm.data_version != 0 ? true : false;
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, set_data_seq);
ret = scoutfs_inode_index_lock_hold(inode, &ind_locks, set_data_seq, false);
if (ret)
goto unlock;
@@ -684,6 +689,8 @@ static long scoutfs_ioc_setattr_more(struct file *file, unsigned long arg)
i_size_write(inode, sm.i_size);
inode->i_ctime.tv_sec = sm.ctime_sec;
inode->i_ctime.tv_nsec = sm.ctime_nsec;
si->crtime.tv_sec = sm.crtime_sec;
si->crtime.tv_nsec = sm.crtime_nsec;
scoutfs_update_inode_item(inode, lock, &ind_locks);
ret = 0;
@@ -879,6 +886,7 @@ static long scoutfs_ioc_statfs_more(struct file *file, unsigned long arg)
sfm.rid = sbi->rid;
sfm.total_meta_blocks = le64_to_cpu(super->total_meta_blocks);
sfm.total_data_blocks = le64_to_cpu(super->total_data_blocks);
sfm.reserved_meta_blocks = scoutfs_server_reserved_meta_blocks(sb);
ret = scoutfs_client_get_last_seq(sb, &sfm.committed_seq);
if (ret)
@@ -972,12 +980,18 @@ static long scoutfs_ioc_move_blocks(struct file *file, unsigned long arg)
goto out;
}
if (mb.flags & SCOUTFS_IOC_MB_UNKNOWN) {
ret = -EINVAL;
goto out;
}
ret = mnt_want_write_file(file);
if (ret < 0)
goto out;
ret = scoutfs_data_move_blocks(from, mb.from_off, mb.len,
to, mb.to_off);
to, mb.to_off, !!(mb.flags & SCOUTFS_IOC_MB_STAGE),
mb.data_version);
mnt_drop_write_file(file);
out:
fput(from_file);

View File

@@ -163,7 +163,7 @@ struct scoutfs_ioctl_ino_path_result {
__u64 dir_pos;
__u16 path_bytes;
__u8 _pad[6];
__u8 path[0];
__u8 path[];
};
/* Get a single path from the root to the given inode number */
@@ -232,6 +232,9 @@ struct scoutfs_ioctl_stat_more {
__u64 data_version;
__u64 online_blocks;
__u64 offline_blocks;
__u64 crtime_sec;
__u32 crtime_nsec;
__u8 _pad[4];
};
#define SCOUTFS_IOC_STAT_MORE _IOR(SCOUTFS_IOCTL_MAGIC, 5, \
@@ -259,7 +262,7 @@ struct scoutfs_ioctl_data_waiting {
__u8 _pad[6];
};
#define SCOUTFS_IOC_DATA_WAITING_FLAGS_UNKNOWN (U8_MAX << 0)
#define SCOUTFS_IOC_DATA_WAITING_FLAGS_UNKNOWN (U64_MAX << 0)
#define SCOUTFS_IOC_DATA_WAITING _IOR(SCOUTFS_IOCTL_MAGIC, 6, \
struct scoutfs_ioctl_data_waiting)
@@ -275,11 +278,12 @@ struct scoutfs_ioctl_setattr_more {
__u64 flags;
__u64 ctime_sec;
__u32 ctime_nsec;
__u8 _pad[4];
__u32 crtime_nsec;
__u64 crtime_sec;
};
#define SCOUTFS_IOC_SETATTR_MORE_OFFLINE (1 << 0)
#define SCOUTFS_IOC_SETATTR_MORE_UNKNOWN (U8_MAX << 1)
#define SCOUTFS_IOC_SETATTR_MORE_UNKNOWN (U64_MAX << 1)
#define SCOUTFS_IOC_SETATTR_MORE _IOW(SCOUTFS_IOCTL_MAGIC, 7, \
struct scoutfs_ioctl_setattr_more)
@@ -371,6 +375,7 @@ struct scoutfs_ioctl_statfs_more {
__u64 committed_seq;
__u64 total_meta_blocks;
__u64 total_data_blocks;
__u64 reserved_meta_blocks;
};
#define SCOUTFS_IOC_STATFS_MORE _IOR(SCOUTFS_IOCTL_MAGIC, 10, \
@@ -418,12 +423,13 @@ struct scoutfs_ioctl_alloc_detail_entry {
* on the same file system.
*
* from_fd specifies the source file and the ioctl is called on the
* destination file. Both files must have write access. from_off
* specifies the byte offset in the source, to_off is the byte offset in
* the destination, and len is the number of bytes in the region to
* move. All of the offsets and lengths must be in multiples of 4KB,
* except in the case where the from_off + len ends at the i_size of the
* source file.
* destination file. Both files must have write access. from_off specifies
* the byte offset in the source, to_off is the byte offset in the
* destination, and len is the number of bytes in the region to move. All of
* the offsets and lengths must be in multiples of 4KB, except in the case
* where the from_off + len ends at the i_size of the source
* file. data_version is only used when STAGE flag is set (see below). flags
* field is currently only used to optionally specify STAGE behavior.
*
* This interface only moves extents which are block granular, it does
* not perform RMW of sub-block byte extents and it does not overwrite
@@ -435,30 +441,41 @@ struct scoutfs_ioctl_alloc_detail_entry {
* i_size. The i_size update will maintain final partial blocks in the
* source.
*
* It will return an error if either of the files have offline extents.
* It will return 0 when all of the extents in the source region have
* been moved to the destination. Moving extents updates the ctime,
* mtime, meta_seq, data_seq, and data_version fields of both the source
* and destination inodes. If an error is returned then partial
* If STAGE flag is not set, it will return an error if either of the files
* have offline extents. It will return 0 when all of the extents in the
* source region have been moved to the destination. Moving extents updates
* the ctime, mtime, meta_seq, data_seq, and data_version fields of both the
* source and destination inodes. If an error is returned then partial
* progress may have been made and inode fields may have been updated.
*
* If STAGE flag is set, as above except destination range must be in an
* offline extent. Fields are updated only for source inode.
*
* Errors specific to this interface include:
*
* EINVAL: from_off, len, or to_off aren't a multiple of 4KB; the source
* and destination files are the same inode; either the source or
* destination is not a regular file; the destination file has
* an existing overlapping extent.
* an existing overlapping extent (if STAGE flag not set); the
* destination range is not in an offline extent (if STAGE set).
* EOVERFLOW: either from_off + len or to_off + len exceeded 64bits.
* EBADF: from_fd isn't a valid open file descriptor.
* EXDEV: the source and destination files are in different filesystems.
* EISDIR: either the source or destination is a directory.
* ENODATA: either the source or destination file have offline extents.
* ENODATA: either the source or destination file have offline extents and
* STAGE flag is not set.
* ESTALE: data_version does not match destination data_version.
*/
#define SCOUTFS_IOC_MB_STAGE (1 << 0)
#define SCOUTFS_IOC_MB_UNKNOWN (U64_MAX << 1)
struct scoutfs_ioctl_move_blocks {
__u64 from_fd;
__u64 from_off;
__u64 len;
__u64 to_off;
__u64 data_version;
__u64 flags;
};
#define SCOUTFS_IOC_MOVE_BLOCKS _IOR(SCOUTFS_IOCTL_MAGIC, 13, \

View File

@@ -95,7 +95,7 @@ struct item_cache_info {
/* written by page readers, read by shrink */
spinlock_t active_lock;
struct rb_root active_root;
struct list_head active_list;
};
#define DECLARE_ITEM_CACHE_INFO(sb, name) \
@@ -127,6 +127,7 @@ struct cached_page {
unsigned long lru_time;
struct list_head dirty_list;
struct list_head dirty_head;
u64 max_liv_seq;
struct page *page;
unsigned int page_off;
unsigned int erased_bytes;
@@ -149,7 +150,8 @@ struct cached_item {
static int item_val_bytes(int val_len)
{
return round_up(offsetof(struct cached_item, val[val_len]), CACHED_ITEM_ALIGN);
return round_up(offsetof(struct cached_item, val[val_len]),
CACHED_ITEM_ALIGN);
}
/*
@@ -345,7 +347,8 @@ static struct cached_page *alloc_pg(struct super_block *sb, gfp_t gfp)
page = alloc_page(GFP_NOFS | gfp);
if (!page || !pg) {
kfree(pg);
__free_page(page);
if (page)
__free_page(page);
return NULL;
}
@@ -383,6 +386,14 @@ static void put_pg(struct super_block *sb, struct cached_page *pg)
}
}
static void update_pg_max_liv_seq(struct cached_page *pg, struct cached_item *item)
{
u64 liv_seq = le64_to_cpu(item->liv.seq);
if (liv_seq > pg->max_liv_seq)
pg->max_liv_seq = liv_seq;
}
/*
* Allocate space for a new item from the free offset at the end of a
* cached page. This isn't a blocking allocation, and it's likely that
@@ -414,14 +425,15 @@ static struct cached_item *alloc_item(struct cached_page *pg,
if (val_len)
memcpy(item->val, val, val_len);
update_pg_max_liv_seq(pg, item);
return item;
}
static void erase_item(struct cached_page *pg, struct cached_item *item)
{
rbtree_erase(&item->node, &pg->item_root);
pg->erased_bytes += round_up(item_val_bytes(item->val_len),
CACHED_ITEM_ALIGN);
pg->erased_bytes += item_val_bytes(item->val_len);
}
static void lru_add(struct super_block *sb, struct item_cache_info *cinf,
@@ -621,6 +633,8 @@ static void mark_item_dirty(struct super_block *sb,
list_add_tail(&item->dirty_head, &pg->dirty_list);
item->dirty = 1;
}
update_pg_max_liv_seq(pg, item);
}
static void clear_item_dirty(struct super_block *sb,
@@ -852,8 +866,7 @@ static void compact_page_items(struct super_block *sb,
for (from = first_item(&pg->item_root); from; from = next_item(from)) {
to = page_address(empty->page) + page_off;
page_off += round_up(item_val_bytes(from->val_len),
CACHED_ITEM_ALIGN);
page_off += item_val_bytes(from->val_len);
/* copy the entire item, struct members and all */
memcpy(to, from, item_val_bytes(from->val_len));
@@ -1260,46 +1273,76 @@ static int cache_empty_page(struct super_block *sb,
return 0;
}
/*
* Readers operate independently from dirty items and transactions.
* They read a set of persistent items and insert them into the cache
* when there aren't already pages whose key range contains the items.
* This naturally prefers cached dirty items over stale read items.
*
* We have to deal with the case where dirty items are written and
* invalidated while a read is in flight. The reader won't have seen
* the items that were dirty in their persistent roots as they started
* reading. By the time they insert their read pages the previously
* dirty items have been reclaimed and are not in the cache. The old
* stale items will be inserted in their place, effectively corrupting
* by having the dirty items disappear.
*
* We fix this by tracking the max seq of items in pages. As readers
* start they record the current transaction seq. Invalidation skips
* pages with a max seq greater than the first reader seq because the
* items in the page have to stick around to prevent the readers stale
* items from being inserted.
*
* This naturally only affects a small set of pages with items that were
* written relatively recently. If we're in memory pressure then we
* probably have a lot of pages and they'll naturally have items that
* were visible to any raders. We don't bother with the complicated and
* expensive further refinement of tracking the ranges that are being
* read and comparing those with pages to invalidate.
*/
struct active_reader {
struct rb_node node;
struct scoutfs_key start;
struct scoutfs_key end;
struct list_head head;
u64 seq;
};
static struct active_reader *active_rbtree_walk(struct rb_root *root,
struct scoutfs_key *start,
struct scoutfs_key *end,
struct rb_node **par,
struct rb_node ***pnode)
#define INIT_ACTIVE_READER(rdr) \
struct active_reader rdr = { .head = LIST_HEAD_INIT(rdr.head) }
static void add_active_reader(struct super_block *sb, struct active_reader *active)
{
DECLARE_ITEM_CACHE_INFO(sb, cinf);
BUG_ON(!list_empty(&active->head));
active->seq = scoutfs_trans_sample_seq(sb);
spin_lock(&cinf->active_lock);
list_add_tail(&active->head, &cinf->active_list);
spin_unlock(&cinf->active_lock);
}
static u64 first_active_reader_seq(struct item_cache_info *cinf)
{
struct rb_node **node = &root->rb_node;
struct rb_node *parent = NULL;
struct active_reader *ret = NULL;
struct active_reader *active;
int cmp;
u64 first;
while (*node) {
parent = *node;
active = container_of(*node, struct active_reader, node);
/* only the calling task adds or deletes this active */
spin_lock(&cinf->active_lock);
active = list_first_entry_or_null(&cinf->active_list, struct active_reader, head);
first = active ? active->seq : U64_MAX;
spin_unlock(&cinf->active_lock);
cmp = scoutfs_key_compare_ranges(start, end, &active->start,
&active->end);
if (cmp < 0) {
node = &(*node)->rb_left;
} else if (cmp > 0) {
node = &(*node)->rb_right;
} else {
ret = active;
node = &(*node)->rb_left;
}
return first;
}
static void del_active_reader(struct item_cache_info *cinf, struct active_reader *active)
{
/* only the calling task adds or deletes this active */
if (!list_empty(&active->head)) {
spin_lock(&cinf->active_lock);
list_del_init(&active->head);
spin_unlock(&cinf->active_lock);
}
if (par)
*par = parent;
if (pnode)
*pnode = node;
return ret;
}
/*
@@ -1308,10 +1351,10 @@ static struct active_reader *active_rbtree_walk(struct rb_root *root,
* on our root and aren't in dirty or lru lists.
*
* We need to store deletion items here as we read items from all the
* btrees so that they can override older versions of the items. The
* deletion items will be deleted before we insert the pages into the
* cache. We don't insert old versions of items into the tree here so
* that the trees don't have to compare versions.
* btrees so that they can override older items. The deletion items
* will be deleted before we insert the pages into the cache. We don't
* insert old versions of items into the tree here so that the trees
* don't have to compare seqs.
*/
static int read_page_item(struct super_block *sb, struct scoutfs_key *key,
struct scoutfs_log_item_value *liv, void *val,
@@ -1331,7 +1374,7 @@ static int read_page_item(struct super_block *sb, struct scoutfs_key *key,
pg = page_rbtree_walk(sb, root, key, key, NULL, NULL, &p_par, &p_pnode);
found = item_rbtree_walk(&pg->item_root, key, NULL, &par, &pnode);
if (found && (le64_to_cpu(found->liv.vers) >= le64_to_cpu(liv->vers)))
if (found && (le64_to_cpu(found->liv.seq) >= le64_to_cpu(liv->seq)))
return 0;
if (!page_has_room(pg, val_len)) {
@@ -1399,22 +1442,15 @@ static int read_page_item(struct super_block *sb, struct scoutfs_key *key,
* locks held, but without locking the cache. The regions we read can
* be stale with respect to the current cache, which can be read and
* dirtied by other cluster lock holders on our node, but the cluster
* locks protect the stable items we read.
*
* There's also the exciting case where a reader can populate the cache
* with stale old persistent data which was read before another local
* cluster lock holder was able to read, dirty, write, and then shrink
* the cache. In this case the cache couldn't be cleared by lock
* invalidation because the caller is actively holding the lock. But
* shrinking could evict the cache within the held lock. So we record
* that we're an active reader in the range covered by the lock and
* shrink will refuse to reclaim any pages that intersect with our read.
* locks protect the stable items we read. Invalidation is careful not
* to drop pages that have items that we couldn't see because they were
* dirty when we started reading.
*/
static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
struct scoutfs_key *key, struct scoutfs_lock *lock)
{
struct rb_root root = RB_ROOT;
struct active_reader active;
INIT_ACTIVE_READER(active);
struct cached_page *right = NULL;
struct cached_page *pg;
struct cached_page *rd;
@@ -1430,15 +1466,6 @@ static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
int pgi;
int ret;
/* stop shrink from freeing new clean data, would let us cache stale */
active.start = lock->start;
active.end = lock->end;
spin_lock(&cinf->active_lock);
active_rbtree_walk(&cinf->active_root, &active.start, &active.end,
&par, &pnode);
rbtree_insert(&active.node, par, pnode, &cinf->active_root);
spin_unlock(&cinf->active_lock);
/* start with an empty page that covers the whole lock */
pg = alloc_pg(sb, 0);
if (!pg) {
@@ -1449,6 +1476,9 @@ static int read_pages(struct super_block *sb, struct item_cache_info *cinf,
pg->end = lock->end;
rbtree_insert(&pg->node, NULL, &root.rb_node, &root);
/* set active reader seq before reading persistent roots */
add_active_reader(sb, &active);
ret = scoutfs_forest_read_items(sb, lock, key, &start, &end,
read_page_item, &root);
if (ret < 0)
@@ -1526,9 +1556,7 @@ retry:
ret = 0;
out:
spin_lock(&cinf->active_lock);
rbtree_erase(&active.node, &cinf->active_root);
spin_unlock(&cinf->active_lock);
del_active_reader(cinf, &active);
/* free any pages we left dangling on error */
for_each_page_safe(&root, rd, pg_tmp) {
@@ -1783,6 +1811,21 @@ out:
return ret;
}
/*
* An item's seq is greater of the client transaction's seq and the
* lock's write_seq. This ensures that multiple commits in one lock
* grant will have increasing seqs, and new locks in open commits will
* also increase the seqs. It lets us limit the inputs of item merging
* to the last stable seq and ensure that all the items in open
* transactions and granted locks will have greater seqs.
*/
static __le64 item_seq(struct super_block *sb, struct scoutfs_lock *lock)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
return cpu_to_le64(max(sbi->trans_seq, lock->write_seq));
}
/*
* Mark the item dirty. Dirtying while holding a transaction pins the
* page holding the item and guarantees that the item can be deleted or
@@ -1815,8 +1858,8 @@ int scoutfs_item_dirty(struct super_block *sb, struct scoutfs_key *key,
if (!item || item->deletion) {
ret = -ENOENT;
} else {
item->liv.seq = item_seq(sb, lock);
mark_item_dirty(sb, cinf, pg, NULL, item);
item->liv.vers = cpu_to_le64(lock->write_version);
ret = 0;
}
@@ -1836,7 +1879,7 @@ static int item_create(struct super_block *sb, struct scoutfs_key *key,
{
DECLARE_ITEM_CACHE_INFO(sb, cinf);
struct scoutfs_log_item_value liv = {
.vers = cpu_to_le64(lock->write_version),
.seq = item_seq(sb, lock),
};
struct cached_item *found;
struct cached_item *item;
@@ -1911,7 +1954,7 @@ int scoutfs_item_update(struct super_block *sb, struct scoutfs_key *key,
{
DECLARE_ITEM_CACHE_INFO(sb, cinf);
struct scoutfs_log_item_value liv = {
.vers = cpu_to_le64(lock->write_version),
.seq = item_seq(sb, lock),
};
struct cached_item *item;
struct cached_item *found;
@@ -1944,9 +1987,10 @@ int scoutfs_item_update(struct super_block *sb, struct scoutfs_key *key,
if (val_len)
memcpy(found->val, val, val_len);
if (val_len < found->val_len)
pg->erased_bytes += found->val_len - val_len;
pg->erased_bytes += item_val_bytes(found->val_len) -
item_val_bytes(val_len);
found->val_len = val_len;
found->liv.vers = liv.vers;
found->liv.seq = liv.seq;
mark_item_dirty(sb, cinf, pg, NULL, found);
} else {
item = alloc_item(pg, key, &liv, val, val_len);
@@ -1978,7 +2022,7 @@ static int item_delete(struct super_block *sb, struct scoutfs_key *key,
{
DECLARE_ITEM_CACHE_INFO(sb, cinf);
struct scoutfs_log_item_value liv = {
.vers = cpu_to_le64(lock->write_version),
.seq = item_seq(sb, lock),
};
struct cached_item *item;
struct cached_page *pg;
@@ -2020,10 +2064,11 @@ static int item_delete(struct super_block *sb, struct scoutfs_key *key,
erase_item(pg, item);
} else {
/* must emit deletion to clobber old persistent item */
item->liv.vers = cpu_to_le64(lock->write_version);
item->liv.seq = liv.seq;
item->liv.flags |= SCOUTFS_LOG_ITEM_FLAG_DELETION;
item->deletion = 1;
pg->erased_bytes += item->val_len;
pg->erased_bytes += item_val_bytes(item->val_len) -
item_val_bytes(0);
item->val_len = 0;
mark_item_dirty(sb, cinf, pg, NULL, item);
}
@@ -2106,7 +2151,7 @@ int scoutfs_item_write_dirty(struct super_block *sb)
struct page *page;
LIST_HEAD(pages);
LIST_HEAD(pos);
u64 max_vers = 0;
u64 max_seq = 0;
int val_len;
int bytes;
int off;
@@ -2171,7 +2216,7 @@ int scoutfs_item_write_dirty(struct super_block *sb)
val_len = sizeof(item->liv) + item->val_len;
bytes = offsetof(struct scoutfs_btree_item_list,
val[val_len]);
max_vers = max(max_vers, le64_to_cpu(item->liv.vers));
max_seq = max(max_seq, le64_to_cpu(item->liv.seq));
if (off + bytes > PAGE_SIZE) {
page = second;
@@ -2201,8 +2246,8 @@ int scoutfs_item_write_dirty(struct super_block *sb)
read_unlock(&pg->rwlock);
}
/* store max item vers in forest's log_trees */
scoutfs_forest_set_max_vers(sb, max_vers);
/* store max item seq in forest's log_trees */
scoutfs_forest_set_max_seq(sb, max_seq);
/* write all the dirty items into log btree blocks */
ret = scoutfs_forest_insert_list(sb, first);
@@ -2389,9 +2434,9 @@ retry:
/*
* Shrink the size the item cache. We're operating against the fast
* path lock ordering and we skip pages if we can't acquire locks.
* Similarly, we can run into dirty pages or pages which intersect with
* active readers that we can't shrink and also choose to skip.
* path lock ordering and we skip pages if we can't acquire locks. We
* can run into dirty pages or pages with items that weren't visible to
* the earliest active reader which must be skipped.
*/
static int item_lru_shrink(struct shrinker *shrink,
struct shrink_control *sc)
@@ -2400,26 +2445,24 @@ static int item_lru_shrink(struct shrinker *shrink,
struct item_cache_info,
shrinker);
struct super_block *sb = cinf->sb;
struct active_reader *active;
struct cached_page *tmp;
struct cached_page *pg;
u64 first_reader_seq;
int nr;
if (sc->nr_to_scan == 0)
goto out;
nr = sc->nr_to_scan;
/* can't invalidate pages with items that weren't visible to first reader */
first_reader_seq = first_active_reader_seq(cinf);
write_lock(&cinf->rwlock);
spin_lock(&cinf->lru_lock);
list_for_each_entry_safe(pg, tmp, &cinf->lru_list, lru_head) {
/* can't invalidate ranges being read, reader might be stale */
spin_lock(&cinf->active_lock);
active = active_rbtree_walk(&cinf->active_root, &pg->start,
&pg->end, NULL, NULL);
spin_unlock(&cinf->active_lock);
if (active) {
if (first_reader_seq <= pg->max_liv_seq) {
scoutfs_inc_counter(sb, item_shrink_page_reader);
continue;
}
@@ -2488,7 +2531,7 @@ int scoutfs_item_setup(struct super_block *sb)
spin_lock_init(&cinf->lru_lock);
INIT_LIST_HEAD(&cinf->lru_list);
spin_lock_init(&cinf->active_lock);
cinf->active_root = RB_ROOT;
INIT_LIST_HEAD(&cinf->active_list);
cinf->pcpu_pages = alloc_percpu(struct item_percpu_pages);
if (!cinf->pcpu_pages)
@@ -2519,7 +2562,7 @@ void scoutfs_item_destroy(struct super_block *sb)
int cpu;
if (cinf) {
BUG_ON(!RB_EMPTY_ROOT(&cinf->active_root));
BUG_ON(!list_empty(&cinf->active_list));
unregister_hotcpu_notifier(&cinf->notifier);
unregister_shrinker(&cinf->shrinker);

View File

@@ -108,6 +108,16 @@ static inline void scoutfs_key_set_ones(struct scoutfs_key *key)
memset(key->__pad, 0, sizeof(key->__pad));
}
static inline bool scoutfs_key_is_ones(struct scoutfs_key *key)
{
return key->sk_zone == U8_MAX &&
key->_sk_first == cpu_to_le64(U64_MAX) &&
key->sk_type == U8_MAX &&
key->_sk_second == cpu_to_le64(U64_MAX) &&
key->_sk_third == cpu_to_le64(U64_MAX) &&
key->_sk_fourth == U8_MAX;
}
/*
* Return a -1/0/1 comparison of keys.
*

View File

@@ -34,6 +34,7 @@
#include "data.h"
#include "xattr.h"
#include "item.h"
#include "omap.h"
/*
* scoutfs uses a lock service to manage item cache consistency between
@@ -74,6 +75,7 @@ struct lock_info {
struct super_block *sb;
spinlock_t lock;
bool shutdown;
bool unmounting;
struct rb_root lock_tree;
struct rb_root lock_range_tree;
struct shrinker shrinker;
@@ -87,6 +89,9 @@ struct lock_info {
struct work_struct shrink_work;
struct list_head shrink_list;
atomic64_t next_refresh_gen;
struct work_struct inv_iput_work;
struct llist_head inv_iput_llist;
struct dentry *tseq_dentry;
struct scoutfs_tseq_tree tseq_tree;
};
@@ -122,21 +127,79 @@ static bool lock_modes_match(int granted, int requested)
}
/*
* invalidate cached data associated with an inode whose lock is going
* Final iput can get into evict and perform final inode deletion which
* can delete a lot of items under locks and transactions. We really
* don't want to be doing all that in an iput during invalidation. When
* invalidation sees that iput might perform final deletion it puts them
* on a list and queues this work.
*
* Nothing stops multiple puts for multiple invalidations of an inode
* before the work runs so we can track multiple puts in flight.
*/
static void lock_inv_iput_worker(struct work_struct *work)
{
struct lock_info *linfo = container_of(work, struct lock_info, inv_iput_work);
struct scoutfs_inode_info *si;
struct scoutfs_inode_info *tmp;
struct llist_node *inodes;
bool more;
inodes = llist_del_all(&linfo->inv_iput_llist);
llist_for_each_entry_safe(si, tmp, inodes, inv_iput_llnode) {
do {
more = atomic_dec_return(&si->inv_iput_count) > 0;
iput(&si->inode);
} while (more);
}
}
/*
* Invalidate cached data associated with an inode whose lock is going
* away.
*
* We try to drop cached dentries and inodes covered by the lock if they
* aren't referenced. This removes them from the mount's open map and
* allows deletions to be performed by unlink without having to wait for
* remote cached inodes to be dropped.
*
* If the cached inode was already deferring final inode deletion then
* we can't perform that inline in invalidation. The locking alone
* deadlock, and it might also take multiple transactions to fully
* delete an inode with significant metadata. We only perform the iput
* inline if we know that possible eviction can't perform the final
* deletion, otherwise we kick it off to async work.
*/
static void invalidate_inode(struct super_block *sb, u64 ino)
{
DECLARE_LOCK_INFO(sb, linfo);
struct scoutfs_inode_info *si;
struct inode *inode;
inode = scoutfs_ilookup(sb, ino);
if (inode) {
si = SCOUTFS_I(inode);
scoutfs_inc_counter(sb, lock_invalidate_inode);
if (S_ISREG(inode->i_mode)) {
truncate_inode_pages(inode->i_mapping, 0);
scoutfs_data_wait_changed(inode);
}
iput(inode);
/* can't touch during unmount, dcache destroys w/o locks */
if (!linfo->unmounting)
d_prune_aliases(inode);
si->drop_invalidated = true;
if (scoutfs_lock_is_covered(sb, &si->ino_lock_cov) && inode->i_nlink > 0) {
iput(inode);
} else {
/* defer iput to work context so we don't evict inodes from invalidation */
if (atomic_inc_return(&si->inv_iput_count) == 1)
llist_add(&si->inv_iput_llnode, &linfo->inv_iput_llist);
smp_wmb(); /* count and list visible before work executes */
queue_work(linfo->workq, &linfo->inv_iput_work);
}
}
}
@@ -172,6 +235,16 @@ static int lock_invalidate(struct super_block *sb, struct scoutfs_lock *lock,
/* have to invalidate if we're not in the only usable case */
if (!(prev == SCOUTFS_LOCK_WRITE && mode == SCOUTFS_LOCK_READ)) {
retry:
/* invalidate inodes before removing coverage */
if (lock->start.sk_zone == SCOUTFS_FS_ZONE) {
ino = le64_to_cpu(lock->start.ski_ino);
last = le64_to_cpu(lock->end.ski_ino);
while (ino <= last) {
invalidate_inode(sb, ino);
ino++;
}
}
/* remove cov items to tell users that their cache is stale */
spin_lock(&lock->cov_list_lock);
list_for_each_entry_safe(cov, tmp, &lock->cov_list, head) {
@@ -187,15 +260,6 @@ retry:
}
spin_unlock(&lock->cov_list_lock);
if (lock->start.sk_zone == SCOUTFS_FS_ZONE) {
ino = le64_to_cpu(lock->start.ski_ino);
last = le64_to_cpu(lock->end.ski_ino);
while (ino <= last) {
invalidate_inode(sb, ino);
ino++;
}
}
scoutfs_item_invalidate(sb, &lock->start, &lock->end);
}
@@ -229,6 +293,7 @@ static void lock_free(struct lock_info *linfo, struct scoutfs_lock *lock)
BUG_ON(!list_empty(&lock->shrink_head));
BUG_ON(!list_empty(&lock->cov_list));
scoutfs_omap_free_lock_data(lock->omap_data);
kfree(lock);
}
@@ -264,6 +329,7 @@ static struct scoutfs_lock *lock_alloc(struct super_block *sb,
lock->mode = SCOUTFS_LOCK_NULL;
atomic64_set(&lock->forest_bloom_nr, 0);
spin_lock_init(&lock->omap_spinlock);
trace_scoutfs_lock_alloc(sb, lock);
@@ -553,7 +619,7 @@ static void queue_grant_work(struct lock_info *linfo)
{
assert_spin_locked(&linfo->lock);
if (!list_empty(&linfo->grant_list) && !linfo->shutdown)
if (!list_empty(&linfo->grant_list))
queue_work(linfo->workq, &linfo->grant_work);
}
@@ -569,7 +635,7 @@ static void queue_inv_work(struct lock_info *linfo)
{
assert_spin_locked(&linfo->lock);
if (!list_empty(&linfo->inv_list) && !linfo->shutdown)
if (!list_empty(&linfo->inv_list))
mod_delayed_work(linfo->workq, &linfo->inv_dwork, 0);
}
@@ -638,7 +704,6 @@ static void lock_grant_worker(struct work_struct *work)
struct lock_info *linfo = container_of(work, struct lock_info,
grant_work);
struct super_block *sb = linfo->sb;
struct scoutfs_net_lock_grant_response *gr;
struct scoutfs_net_lock *nl;
struct scoutfs_lock *lock;
struct scoutfs_lock *tmp;
@@ -648,8 +713,7 @@ static void lock_grant_worker(struct work_struct *work)
spin_lock(&linfo->lock);
list_for_each_entry_safe(lock, tmp, &linfo->grant_list, grant_head) {
gr = &lock->grant_resp;
nl = &lock->grant_resp.nl;
nl = &lock->grant_nl;
/* wait for reordered invalidation to finish */
if (lock->mode != nl->old_mode)
@@ -666,8 +730,7 @@ static void lock_grant_worker(struct work_struct *work)
lock->request_pending = 0;
lock->mode = nl->new_mode;
lock->write_version = le64_to_cpu(nl->write_version);
lock->roots = gr->roots;
lock->write_seq = le64_to_cpu(nl->write_seq);
if (lock_count_match_exists(nl->new_mode, lock->waiters))
extend_grace(sb, lock);
@@ -689,9 +752,8 @@ static void lock_grant_worker(struct work_struct *work)
* work to process.
*/
int scoutfs_lock_grant_response(struct super_block *sb,
struct scoutfs_net_lock_grant_response *gr)
struct scoutfs_net_lock *nl)
{
struct scoutfs_net_lock *nl = &gr->nl;
DECLARE_LOCK_INFO(sb, linfo);
struct scoutfs_lock *lock;
@@ -705,7 +767,7 @@ int scoutfs_lock_grant_response(struct super_block *sb,
trace_scoutfs_lock_grant_response(sb, lock);
BUG_ON(!lock->request_pending);
lock->grant_resp = *gr;
lock->grant_nl = *nl;
list_add_tail(&lock->grant_head, &linfo->grant_list);
queue_grant_work(linfo);
@@ -717,7 +779,9 @@ int scoutfs_lock_grant_response(struct super_block *sb,
/*
* Each lock has received a lock invalidation request from the server
* which specifies a new mode for the lock. The server will only send
* one invalidation request at a time for each lock.
* one invalidation request at a time for each lock. The server can
* send another invalidate request after we send the response but before
* we reacquire the lock and finish invalidation.
*
* This is an unsolicited request from the server so it can arrive at
* any time after we make the server aware of the lock by initially
@@ -804,8 +868,14 @@ static void lock_invalidate_worker(struct work_struct *work)
nl = &lock->inv_nl;
net_id = lock->inv_net_id;
ret = lock_invalidate(sb, lock, nl->old_mode, nl->new_mode);
BUG_ON(ret);
/* only lock protocol, inv can't call subsystems after shutdown */
if (!linfo->shutdown) {
ret = lock_invalidate(sb, lock, nl->old_mode, nl->new_mode);
BUG_ON(ret);
}
/* allow another request after we respond but before we finish */
lock->inv_net_id = 0;
/* respond with the key and modes from the request */
ret = scoutfs_client_lock_response(sb, net_id, nl);
@@ -818,11 +888,16 @@ static void lock_invalidate_worker(struct work_struct *work)
spin_lock(&linfo->lock);
list_for_each_entry_safe(lock, tmp, &ready, inv_head) {
list_del_init(&lock->inv_head);
lock->invalidate_pending = 0;
trace_scoutfs_lock_invalidated(sb, lock);
wake_up(&lock->waitq);
if (lock->inv_net_id == 0) {
/* finish if another request didn't arrive */
list_del_init(&lock->inv_head);
lock->invalidate_pending = 0;
wake_up(&lock->waitq);
} else {
/* another request filled nl/net_id, put it back on the list */
list_move_tail(&lock->inv_head, &linfo->inv_list);
}
put_lock(linfo, lock);
}
@@ -837,34 +912,47 @@ out:
}
/*
* Record an incoming invalidate request from the server and add its lock
* to the list for processing.
* Record an incoming invalidate request from the server and add its
* lock to the list for processing. This request can be from a new
* server and racing with invalidation that frees from an old server.
* It's fine to not find the requested lock and send an immediate
* response.
*
* This is trusting the server and will crash if it's sent bad requests :/
* The invalidation process drops the linfo lock to send responses. The
* moment it does so we can receive another invalidation request (the
* server can ask us to go from write->read then read->null). We allow
* for one chain like this but it's a bug if we receive more concurrent
* invalidation requests than that. The server should be only sending
* one at a time.
*/
int scoutfs_lock_invalidate_request(struct super_block *sb, u64 net_id,
struct scoutfs_net_lock *nl)
{
DECLARE_LOCK_INFO(sb, linfo);
struct scoutfs_lock *lock;
int ret = 0;
scoutfs_inc_counter(sb, lock_invalidate_request);
spin_lock(&linfo->lock);
lock = get_lock(sb, &nl->key);
BUG_ON(!lock);
if (lock) {
BUG_ON(lock->invalidate_pending);
lock->invalidate_pending = 1;
lock->inv_nl = *nl;
BUG_ON(lock->inv_net_id != 0);
lock->inv_net_id = net_id;
list_add_tail(&lock->inv_head, &linfo->inv_list);
lock->inv_nl = *nl;
if (list_empty(&lock->inv_head)) {
list_add_tail(&lock->inv_head, &linfo->inv_list);
lock->invalidate_pending = 1;
}
trace_scoutfs_lock_invalidate_request(sb, lock);
queue_inv_work(linfo);
}
spin_unlock(&linfo->lock);
return 0;
if (!lock)
ret = scoutfs_client_lock_response(sb, net_id, nl);
return ret;
}
/*
@@ -900,7 +988,7 @@ int scoutfs_lock_recover_request(struct super_block *sb, u64 net_id,
for (i = 0; lock && i < SCOUTFS_NET_LOCK_MAX_RECOVER_NR; i++) {
nlr->locks[i].key = lock->start;
nlr->locks[i].write_version = cpu_to_le64(lock->write_version);
nlr->locks[i].write_seq = cpu_to_le64(lock->write_seq);
nlr->locks[i].old_mode = lock->mode;
nlr->locks[i].new_mode = lock->mode;
@@ -995,7 +1083,7 @@ static int lock_key_range(struct super_block *sb, enum scoutfs_lock_mode mode, i
lock_inc_count(lock->waiters, mode);
for (;;) {
if (linfo->shutdown) {
if (WARN_ON_ONCE(linfo->shutdown)) {
ret = -ESHUTDOWN;
break;
}
@@ -1259,29 +1347,28 @@ int scoutfs_lock_inode_index(struct super_block *sb, enum scoutfs_lock_mode mode
}
/*
* The rid lock protects a mount's private persistent items in the rid
* zone. It's held for the duration of the mount. It lets the mount
* modify the rid items at will and signals to other mounts that we're
* still alive and our rid items shouldn't be reclaimed.
* Orphan items are stored in their own zone which are modified with
* shared write_only locks and are read inconsistently without locks by
* background scanning work.
*
* Being held for the entire mount prevents other nodes from reclaiming
* our items, like free blocks, when it would make sense for them to be
* able to. Maybe we have a bunch free and they're trying to allocate
* and are getting ENOSPC.
* Since we only use write_only locks we just lock the entire zone, but
* the api provides the inode in case we ever change the locking scheme.
*/
int scoutfs_lock_rid(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
u64 rid, struct scoutfs_lock **lock)
int scoutfs_lock_orphan(struct super_block *sb, enum scoutfs_lock_mode mode, int flags, u64 ino,
struct scoutfs_lock **lock)
{
struct scoutfs_key start;
struct scoutfs_key end;
scoutfs_key_set_zeros(&start);
start.sk_zone = SCOUTFS_RID_ZONE;
start.sko_rid = cpu_to_le64(rid);
start.sk_zone = SCOUTFS_ORPHAN_ZONE;
start.sko_ino = 0;
start.sk_type = SCOUTFS_ORPHAN_TYPE;
scoutfs_key_set_ones(&end);
end.sk_zone = SCOUTFS_RID_ZONE;
end.sko_rid = cpu_to_le64(rid);
scoutfs_key_set_zeros(&end);
end.sk_zone = SCOUTFS_ORPHAN_ZONE;
end.sko_ino = cpu_to_le64(U64_MAX);
end.sk_type = SCOUTFS_ORPHAN_TYPE;
return lock_key_range(sb, mode, flags, &start, &end, lock);
}
@@ -1477,7 +1564,7 @@ restart:
BUG_ON(lock->mode == SCOUTFS_LOCK_NULL);
BUG_ON(!list_empty(&lock->shrink_head));
if (linfo->shutdown || nr-- == 0)
if (nr-- == 0)
break;
__lock_del_lru(linfo, lock);
@@ -1504,7 +1591,7 @@ out:
return ret;
}
void scoutfs_free_unused_locks(struct super_block *sb, unsigned long nr)
void scoutfs_free_unused_locks(struct super_block *sb)
{
struct lock_info *linfo = SCOUTFS_SB(sb)->lock_info;
struct shrink_control sc = {
@@ -1532,15 +1619,40 @@ static void lock_tseq_show(struct seq_file *m, struct scoutfs_tseq_entry *ent)
}
/*
* The caller is going to be calling _destroy soon and, critically, is
* about to shutdown networking before calling us so that we don't get
* any callbacks while we're destroying. We have to ensure that we
* won't call networking after this returns.
* shrink_dcache_for_umount() tears down dentries with no locking. We
* need to make sure that our invalidation won't touch dentries before
* we return and the caller calls the generic vfs unmount path.
*/
void scoutfs_lock_unmount_begin(struct super_block *sb)
{
DECLARE_LOCK_INFO(sb, linfo);
if (linfo) {
linfo->unmounting = true;
flush_delayed_work(&linfo->inv_dwork);
}
}
/*
* The caller is going to be shutting down transactions and the client.
* We need to make sure that locking won't call either after we return.
*
* Internal fs threads can be using locking, and locking can have async
* work pending. We use ->shutdown to force callers to return
* -ESHUTDOWN and to prevent the future queueing of work that could call
* networking. Locks whose work is stopped will be torn down by _destroy.
* At this point all fs callers and internal services that use locks
* should have stopped. We won't have any callers initiating lock
* transitions and sending requests. We set the shutdown flag to catch
* anyone who breaks this rule.
*
* We unregister the shrinker so that we won't try and send null
* requests in response to memory pressure. The locks will all be
* unceremoniously dropped once we get a farewell response from the
* server which indicates that they destroyed our locking state.
*
* We will still respond to invalidation requests that have to be
* processed to let unmount in other mounts acquire locks and make
* progress. However, we don't fully process the invalidation because
* we're shutting down. We only update the lock state and send the
* response. We shouldn't have any users of locking that require
* invalidation correctness at this point.
*/
void scoutfs_lock_shutdown(struct super_block *sb)
{
@@ -1553,19 +1665,18 @@ void scoutfs_lock_shutdown(struct super_block *sb)
trace_scoutfs_lock_shutdown(sb, linfo);
spin_lock(&linfo->lock);
/* stop the shrinker from queueing work */
unregister_shrinker(&linfo->shrinker);
flush_work(&linfo->shrink_work);
/* cause current and future lock calls to return errors */
spin_lock(&linfo->lock);
linfo->shutdown = true;
for (node = rb_first(&linfo->lock_tree); node; node = rb_next(node)) {
lock = rb_entry(node, struct scoutfs_lock, node);
wake_up(&lock->waitq);
}
spin_unlock(&linfo->lock);
flush_work(&linfo->grant_work);
flush_delayed_work(&linfo->inv_dwork);
flush_work(&linfo->shrink_work);
}
/*
@@ -1593,8 +1704,6 @@ void scoutfs_lock_destroy(struct super_block *sb)
trace_scoutfs_lock_destroy(sb, linfo);
/* stop the shrinker from queueing work */
unregister_shrinker(&linfo->shrinker);
/* make sure that no one's actively using locks */
spin_lock(&linfo->lock);
@@ -1640,8 +1749,10 @@ void scoutfs_lock_destroy(struct super_block *sb)
__lock_del_lru(linfo, lock);
if (!list_empty(&lock->grant_head))
list_del_init(&lock->grant_head);
if (!list_empty(&lock->inv_head))
if (!list_empty(&lock->inv_head)) {
list_del_init(&lock->inv_head);
lock->invalidate_pending = 0;
}
if (!list_empty(&lock->shrink_head))
list_del_init(&lock->shrink_head);
lock_remove(linfo, lock);
@@ -1678,6 +1789,8 @@ int scoutfs_lock_setup(struct super_block *sb)
INIT_WORK(&linfo->shrink_work, lock_shrink_worker);
INIT_LIST_HEAD(&linfo->shrink_list);
atomic64_set(&linfo->next_refresh_gen, 0);
INIT_WORK(&linfo->inv_iput_work, lock_inv_iput_worker);
init_llist_head(&linfo->inv_iput_llist);
scoutfs_tseq_tree_init(&linfo->tseq_tree, lock_tseq_show);
sbi->lock_info = linfo;

View File

@@ -10,8 +10,10 @@
#define SCOUTFS_LOCK_NR_MODES SCOUTFS_LOCK_INVALID
struct scoutfs_omap_lock;
/*
* A few fields (start, end, refresh_gen, write_version, granted_mode)
* A few fields (start, end, refresh_gen, write_seq, granted_mode)
* are referenced by code outside lock.c.
*/
struct scoutfs_lock {
@@ -21,9 +23,8 @@ struct scoutfs_lock {
struct rb_node node;
struct rb_node range_node;
u64 refresh_gen;
u64 write_version;
u64 write_seq;
u64 dirty_trans_seq;
struct scoutfs_net_roots roots;
struct list_head lru_head;
wait_queue_head_t waitq;
ktime_t grace_deadline;
@@ -31,7 +32,7 @@ struct scoutfs_lock {
invalidate_pending:1;
struct list_head grant_head;
struct scoutfs_net_lock_grant_response grant_resp;
struct scoutfs_net_lock grant_nl;
struct list_head inv_head;
struct scoutfs_net_lock inv_nl;
u64 inv_net_id;
@@ -48,6 +49,10 @@ struct scoutfs_lock {
/* the forest tracks which log tree last saw bloom bit updates */
atomic64_t forest_bloom_nr;
/* open ino mapping has a valid map for a held write lock */
spinlock_t omap_spinlock;
struct scoutfs_omap_lock_data *omap_data;
};
struct scoutfs_lock_coverage {
@@ -57,7 +62,7 @@ struct scoutfs_lock_coverage {
};
int scoutfs_lock_grant_response(struct super_block *sb,
struct scoutfs_net_lock_grant_response *gr);
struct scoutfs_net_lock *nl);
int scoutfs_lock_invalidate_request(struct super_block *sb, u64 net_id,
struct scoutfs_net_lock *nl);
int scoutfs_lock_recover_request(struct super_block *sb, u64 net_id,
@@ -80,8 +85,8 @@ int scoutfs_lock_inodes(struct super_block *sb, enum scoutfs_lock_mode mode, int
struct inode *d, struct scoutfs_lock **D_lock);
int scoutfs_lock_rename(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
struct scoutfs_lock **lock);
int scoutfs_lock_rid(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
u64 rid, struct scoutfs_lock **lock);
int scoutfs_lock_orphan(struct super_block *sb, enum scoutfs_lock_mode mode, int flags,
u64 ino, struct scoutfs_lock **lock);
void scoutfs_unlock(struct super_block *sb, struct scoutfs_lock *lock,
enum scoutfs_lock_mode mode);
@@ -96,9 +101,10 @@ void scoutfs_lock_del_coverage(struct super_block *sb,
bool scoutfs_lock_protected(struct scoutfs_lock *lock, struct scoutfs_key *key,
enum scoutfs_lock_mode mode);
void scoutfs_free_unused_locks(struct super_block *sb, unsigned long nr);
void scoutfs_free_unused_locks(struct super_block *sb);
int scoutfs_lock_setup(struct super_block *sb);
void scoutfs_lock_unmount_begin(struct super_block *sb);
void scoutfs_lock_shutdown(struct super_block *sb);
void scoutfs_lock_destroy(struct super_block *sb);

View File

@@ -20,10 +20,10 @@
#include "tseq.h"
#include "spbm.h"
#include "block.h"
#include "btree.h"
#include "msg.h"
#include "scoutfs_trace.h"
#include "lock_server.h"
#include "recov.h"
/*
* The scoutfs server implements a simple lock service. Client mounts
@@ -56,14 +56,11 @@
* Message requests and responses are reliably delivered in order across
* reconnection.
*
* The server maintains a persistent record of connected clients. A new
* server instance discovers these and waits for previously connected
* clients to reconnect and recover their state before proceeding. If
* clients don't reconnect they are forcefully prevented from unsafely
* accessing the shared persistent storage. (fenced, according to the
* rules of the platform.. could range from being powered off to having
* their switch port disabled to having their local block device set
* read-only.)
* As a new server comes up it recovers lock state from existing clients
* which were connected to a previous lock server. Recover requests are
* sent to clients as they connect and they respond with all there
* locks. Once all clients and locks are accounted for normal
* processing can resume.
*
* The lock server doesn't respond to memory pressure. The only way
* locks are freed is if they are invalidated to null on behalf of a
@@ -77,19 +74,13 @@ struct lock_server_info {
struct super_block *sb;
spinlock_t lock;
struct mutex mutex;
struct rb_root locks_root;
struct scoutfs_spbm recovery_pending;
struct delayed_work recovery_dwork;
struct scoutfs_tseq_tree tseq_tree;
struct dentry *tseq_dentry;
struct scoutfs_alloc *alloc;
struct scoutfs_block_writer *wri;
atomic64_t write_version;
};
#define DECLARE_LOCK_SERVER_INFO(sb, name) \
@@ -430,7 +421,7 @@ int scoutfs_lock_server_response(struct super_block *sb, u64 rid,
goto out;
}
/* XXX should always have a server lock here? recovery? */
/* XXX should always have a server lock here? */
snode = get_server_lock(inf, &nl->key, NULL, false);
if (!snode) {
ret = -EINVAL;
@@ -473,31 +464,27 @@ out:
* so we unlock the snode mutex.
*
* All progress must wait for all clients to finish with recovery
* because we don't know which locks they'll hold. The unlocked
* recovery_pending test here is OK. It's filled by setup before
* anything runs. It's emptied by recovery completion. We can get a
* false nonempty result if we race with recovery completion, but that's
* OK because recovery completion processes all the locks that have
* requests after emptying, including the unlikely loser of that race.
* because we don't know which locks they'll hold. Once recover
* finishes the server calls us to kick all the locks that were waiting
* during recovery.
*/
static int process_waiting_requests(struct super_block *sb,
struct server_lock_node *snode)
{
DECLARE_LOCK_SERVER_INFO(sb, inf);
struct scoutfs_net_lock_grant_response gres;
struct scoutfs_net_lock nl;
struct client_lock_entry *req;
struct client_lock_entry *req_tmp;
struct client_lock_entry *gr;
struct client_lock_entry *gr_tmp;
u64 wv;
u64 seq;
int ret;
BUG_ON(!mutex_is_locked(&snode->mutex));
/* processing waits for all invalidation responses or recovery */
if (!list_empty(&snode->invalidated) ||
!scoutfs_spbm_empty(&inf->recovery_pending)) {
scoutfs_recov_next_pending(sb, 0, SCOUTFS_RECOV_LOCKS) != 0) {
ret = 0;
goto out;
}
@@ -531,6 +518,7 @@ static int process_waiting_requests(struct super_block *sb,
nl.key = snode->key;
nl.new_mode = req->mode;
nl.write_seq = 0;
/* see if there's an existing compatible grant to replace */
gr = find_entry(snode, &snode->granted, req->rid);
@@ -543,15 +531,13 @@ static int process_waiting_requests(struct super_block *sb,
if (nl.new_mode == SCOUTFS_LOCK_WRITE ||
nl.new_mode == SCOUTFS_LOCK_WRITE_ONLY) {
wv = atomic64_inc_return(&inf->write_version);
nl.write_version = cpu_to_le64(wv);
/* doesn't commit seq update, recovered with locks */
seq = scoutfs_server_next_seq(sb);
nl.write_seq = cpu_to_le64(seq);
}
gres.nl = nl;
scoutfs_server_get_roots(sb, &gres.roots);
ret = scoutfs_server_lock_response(sb, req->rid,
req->net_id, &gres);
req->net_id, &nl);
if (ret)
goto out;
@@ -573,89 +559,39 @@ out:
return ret;
}
static void init_lock_clients_key(struct scoutfs_key *key, u64 rid)
{
*key = (struct scoutfs_key) {
.sk_zone = SCOUTFS_LOCK_CLIENTS_ZONE,
.sklc_rid = cpu_to_le64(rid),
};
}
/*
* The server received a greeting from a client for the first time. If
* the client had already talked to the server then we must find an
* existing record for it and should begin recovery. If it doesn't have
* a record then its timed out and we can't allow it to reconnect. If
* we're creating a new record for a client we can see EEXIST if the
* greeting is resent to a new server after the record was committed but
* before the response was received by the client.
* the client is in lock recovery then we send the initial lock request.
*
* This is running in concurrent client greeting processing contexts.
*/
int scoutfs_lock_server_greeting(struct super_block *sb, u64 rid,
bool should_exist)
int scoutfs_lock_server_greeting(struct super_block *sb, u64 rid)
{
DECLARE_LOCK_SERVER_INFO(sb, inf);
struct scoutfs_super_block *super = &SCOUTFS_SB(sb)->super;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key key;
int ret;
init_lock_clients_key(&key, rid);
mutex_lock(&inf->mutex);
if (should_exist) {
ret = scoutfs_btree_lookup(sb, &super->lock_clients, &key,
&iref);
if (ret == 0)
scoutfs_btree_put_iref(&iref);
} else {
ret = scoutfs_btree_insert(sb, inf->alloc, inf->wri,
&super->lock_clients,
&key, NULL, 0);
if (ret == -EEXIST)
ret = 0;
}
mutex_unlock(&inf->mutex);
if (should_exist && ret == 0) {
if (scoutfs_recov_is_pending(sb, rid, SCOUTFS_RECOV_LOCKS)) {
scoutfs_key_set_zeros(&key);
ret = scoutfs_server_lock_recover_request(sb, rid, &key);
if (ret)
goto out;
} else {
ret = 0;
}
out:
return ret;
}
/*
* A client sent their last recovery response and can exit recovery. If
* they were the last client in recovery then we can process all the
* server locks that had requests.
* All clients have finished lock recovery, we can make forward process
* on all the queued requests that were waiting on recovery.
*/
static int finished_recovery(struct super_block *sb, u64 rid, bool cancel)
int scoutfs_lock_server_finished_recovery(struct super_block *sb)
{
DECLARE_LOCK_SERVER_INFO(sb, inf);
struct server_lock_node *snode;
struct scoutfs_key key;
bool still_pending;
int ret = 0;
spin_lock(&inf->lock);
scoutfs_spbm_clear(&inf->recovery_pending, rid);
still_pending = !scoutfs_spbm_empty(&inf->recovery_pending);
spin_unlock(&inf->lock);
if (still_pending)
return 0;
if (cancel)
cancel_delayed_work_sync(&inf->recovery_dwork);
scoutfs_key_set_zeros(&key);
scoutfs_info(sb, "all lock clients recovered");
while ((snode = get_server_lock(inf, &key, NULL, true))) {
key = snode->key;
@@ -673,14 +609,6 @@ static int finished_recovery(struct super_block *sb, u64 rid, bool cancel)
return ret;
}
static void set_max_write_version(struct lock_server_info *inf, u64 new)
{
u64 old;
while (new > (old = atomic64_read(&inf->write_version)) &&
(atomic64_cmpxchg(&inf->write_version, old, new) != old));
}
/*
* We sent a lock recover request to the client when we received its
* greeting while in recovery. Here we instantiate all the locks it
@@ -699,16 +627,15 @@ int scoutfs_lock_server_recover_response(struct super_block *sb, u64 rid,
int i;
/* client must be in recovery */
spin_lock(&inf->lock);
if (!scoutfs_spbm_test(&inf->recovery_pending, rid))
if (!scoutfs_recov_is_pending(sb, rid, SCOUTFS_RECOV_LOCKS)) {
ret = -EINVAL;
spin_unlock(&inf->lock);
if (ret)
goto out;
}
/* client has sent us all their locks */
if (nlr->nr == 0) {
ret = finished_recovery(sb, rid, true);
scoutfs_server_recov_finish(sb, rid, SCOUTFS_RECOV_LOCKS);
ret = 0;
goto out;
}
@@ -745,9 +672,9 @@ int scoutfs_lock_server_recover_response(struct super_block *sb, u64 rid,
put_server_lock(inf, snode);
/* make sure next write lock is greater than all recovered */
set_max_write_version(inf,
le64_to_cpu(nlr->locks[i].write_version));
/* make sure next core seq is greater than all lock write seq */
scoutfs_server_set_seq_if_greater(sb,
le64_to_cpu(nlr->locks[i].write_seq));
}
/* send request for next batch of keys */
@@ -759,101 +686,15 @@ out:
return ret;
}
static int get_rid_and_put_ref(struct scoutfs_btree_item_ref *iref, u64 *rid)
{
int ret;
if (iref->val_len == 0) {
*rid = le64_to_cpu(iref->key->sklc_rid);
ret = 0;
} else {
ret = -EIO;
}
scoutfs_btree_put_iref(iref);
return ret;
}
/*
* This work executes if enough time passes without all of the clients
* finishing with recovery and canceling the work. We walk through the
* client records and find any that still have their recovery pending.
*/
static void scoutfs_lock_server_recovery_timeout(struct work_struct *work)
{
struct lock_server_info *inf = container_of(work,
struct lock_server_info,
recovery_dwork.work);
struct super_block *sb = inf->sb;
struct scoutfs_super_block *super = &SCOUTFS_SB(sb)->super;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key key;
bool timed_out;
u64 rid;
int ret;
ret = scoutfs_server_hold_commit(sb);
if (ret)
goto out;
/* we enter recovery if there are any client records */
for (rid = 0; ; rid++) {
init_lock_clients_key(&key, rid);
ret = scoutfs_btree_next(sb, &super->lock_clients, &key, &iref);
if (ret == -ENOENT) {
ret = 0;
break;
}
if (ret == 0)
ret = get_rid_and_put_ref(&iref, &rid);
if (ret < 0)
break;
spin_lock(&inf->lock);
if (scoutfs_spbm_test(&inf->recovery_pending, rid)) {
scoutfs_spbm_clear(&inf->recovery_pending, rid);
timed_out = true;
} else {
timed_out = false;
}
spin_unlock(&inf->lock);
if (!timed_out)
continue;
scoutfs_err(sb, "client rid %016llx lock recovery timed out",
rid);
init_lock_clients_key(&key, rid);
ret = scoutfs_btree_delete(sb, inf->alloc, inf->wri,
&super->lock_clients, &key);
if (ret)
break;
}
ret = scoutfs_server_apply_commit(sb, ret);
out:
/* force processing all pending lock requests */
if (ret == 0)
ret = finished_recovery(sb, 0, false);
if (ret < 0) {
scoutfs_err(sb, "lock server saw err %d while timing out clients, shutting down", ret);
scoutfs_server_abort(sb);
}
}
/*
* A client is leaving the lock service. They aren't using locks and
* won't send any more requests. We tear down all the state we had for
* them. This can be called multiple times for a given client as their
* farewell is resent to new servers. It's OK to not find any state.
* If we fail to delete a persistent entry then we have to shut down and
* hope that the next server has more luck.
*/
int scoutfs_lock_server_farewell(struct super_block *sb, u64 rid)
{
DECLARE_LOCK_SERVER_INFO(sb, inf);
struct scoutfs_super_block *super = &SCOUTFS_SB(sb)->super;
struct client_lock_entry *clent;
struct client_lock_entry *tmp;
struct server_lock_node *snode;
@@ -862,20 +703,7 @@ int scoutfs_lock_server_farewell(struct super_block *sb, u64 rid)
bool freed;
int ret = 0;
mutex_lock(&inf->mutex);
init_lock_clients_key(&key, rid);
ret = scoutfs_btree_delete(sb, inf->alloc, inf->wri,
&super->lock_clients, &key);
mutex_unlock(&inf->mutex);
if (ret == -ENOENT) {
ret = 0;
goto out;
}
if (ret < 0)
goto out;
scoutfs_key_set_zeros(&key);
while ((snode = get_server_lock(inf, &key, NULL, true))) {
freed = false;
@@ -960,23 +788,14 @@ static void lock_server_tseq_show(struct seq_file *m,
/*
* Setup the lock server. This is called before networking can deliver
* requests. If we find existing client records then we enter recovery.
* Lock request processing is deferred until recovery is resolved for
* all the existing clients, either they reconnect and replay locks or
* we time them out.
* requests.
*/
int scoutfs_lock_server_setup(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri, u64 max_vers)
struct scoutfs_block_writer *wri)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_super_block *super = &SCOUTFS_SB(sb)->super;
struct lock_server_info *inf;
SCOUTFS_BTREE_ITEM_REF(iref);
struct scoutfs_key key;
unsigned int nr;
u64 rid;
int ret;
inf = kzalloc(sizeof(struct lock_server_info), GFP_KERNEL);
if (!inf)
@@ -984,15 +803,10 @@ int scoutfs_lock_server_setup(struct super_block *sb,
inf->sb = sb;
spin_lock_init(&inf->lock);
mutex_init(&inf->mutex);
inf->locks_root = RB_ROOT;
scoutfs_spbm_init(&inf->recovery_pending);
INIT_DELAYED_WORK(&inf->recovery_dwork,
scoutfs_lock_server_recovery_timeout);
scoutfs_tseq_tree_init(&inf->tseq_tree, lock_server_tseq_show);
inf->alloc = alloc;
inf->wri = wri;
atomic64_set(&inf->write_version, max_vers); /* inc_return gives +1 */
inf->tseq_dentry = scoutfs_tseq_create("server_locks", sbi->debug_root,
&inf->tseq_tree);
@@ -1003,36 +817,7 @@ int scoutfs_lock_server_setup(struct super_block *sb,
sbi->lock_server_info = inf;
/* we enter recovery if there are any client records */
nr = 0;
for (rid = 0; ; rid++) {
init_lock_clients_key(&key, rid);
ret = scoutfs_btree_next(sb, &super->lock_clients, &key, &iref);
if (ret == -ENOENT)
break;
if (ret == 0)
ret = get_rid_and_put_ref(&iref, &rid);
if (ret < 0)
goto out;
ret = scoutfs_spbm_set(&inf->recovery_pending, rid);
if (ret)
goto out;
nr++;
if (rid == U64_MAX)
break;
}
ret = 0;
if (nr) {
schedule_delayed_work(&inf->recovery_dwork,
msecs_to_jiffies(LOCK_SERVER_RECOVERY_MS));
scoutfs_info(sb, "waiting for %u lock clients to recover", nr);
}
out:
return ret;
return 0;
}
/*
@@ -1050,8 +835,6 @@ void scoutfs_lock_server_destroy(struct super_block *sb)
LIST_HEAD(list);
if (inf) {
cancel_delayed_work_sync(&inf->recovery_dwork);
debugfs_remove(inf->tseq_dentry);
rbtree_postorder_for_each_entry_safe(snode, stmp,
@@ -1070,8 +853,6 @@ void scoutfs_lock_server_destroy(struct super_block *sb)
kfree(snode);
}
scoutfs_spbm_destroy(&inf->recovery_pending);
kfree(inf);
sbi->lock_server_info = NULL;
}

View File

@@ -3,17 +3,17 @@
int scoutfs_lock_server_recover_response(struct super_block *sb, u64 rid,
struct scoutfs_net_lock_recover *nlr);
int scoutfs_lock_server_finished_recovery(struct super_block *sb);
int scoutfs_lock_server_request(struct super_block *sb, u64 rid,
u64 net_id, struct scoutfs_net_lock *nl);
int scoutfs_lock_server_greeting(struct super_block *sb, u64 rid,
bool should_exist);
int scoutfs_lock_server_greeting(struct super_block *sb, u64 rid);
int scoutfs_lock_server_response(struct super_block *sb, u64 rid,
struct scoutfs_net_lock *nl);
int scoutfs_lock_server_farewell(struct super_block *sb, u64 rid);
int scoutfs_lock_server_setup(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri, u64 max_vers);
struct scoutfs_block_writer *wri);
void scoutfs_lock_server_destroy(struct super_block *sb);
#endif

View File

@@ -30,6 +30,7 @@
#include "net.h"
#include "endian_swap.h"
#include "tseq.h"
#include "fence.h"
/*
* scoutfs networking delivers requests and responses between nodes.
@@ -330,6 +331,9 @@ static int submit_send(struct super_block *sb,
WARN_ON_ONCE(id == 0 && (flags & SCOUTFS_NET_FLAG_RESPONSE)))
return -EINVAL;
if (scoutfs_forcing_unmount(sb))
return -EIO;
msend = kmalloc(offsetof(struct message_send,
nh.data[data_len]), GFP_NOFS);
if (!msend)
@@ -420,6 +424,16 @@ static int process_request(struct scoutfs_net_connection *conn,
mrecv->nh.data, le16_to_cpu(mrecv->nh.data_len));
}
static int call_resp_func(struct super_block *sb, struct scoutfs_net_connection *conn,
scoutfs_net_response_t resp_func, void *resp_data,
void *resp, unsigned int resp_len, int error)
{
if (resp_func)
return resp_func(sb, conn, resp, resp_len, error, resp_data);
else
return 0;
}
/*
* An incoming response finds the queued request and calls its response
* function. The response function for a given request will only be
@@ -434,7 +448,6 @@ static int process_response(struct scoutfs_net_connection *conn,
struct message_send *msend;
scoutfs_net_response_t resp_func = NULL;
void *resp_data;
int ret = 0;
spin_lock(&conn->lock);
@@ -449,11 +462,8 @@ static int process_response(struct scoutfs_net_connection *conn,
spin_unlock(&conn->lock);
if (resp_func)
ret = resp_func(sb, conn, mrecv->nh.data,
le16_to_cpu(mrecv->nh.data_len),
net_err_to_host(mrecv->nh.error), resp_data);
return ret;
return call_resp_func(sb, conn, resp_func, resp_data, mrecv->nh.data,
le16_to_cpu(mrecv->nh.data_len), net_err_to_host(mrecv->nh.error));
}
/*
@@ -823,9 +833,15 @@ static void scoutfs_net_destroy_worker(struct work_struct *work)
if (conn->listening_conn && conn->notify_down)
conn->notify_down(sb, conn, conn->info, conn->rid);
/* free all messages, refactor and complete for forced unmount? */
/*
* Usually networking is idle and we destroy pending sends, but when forcing unmount
* we can have to wake up waiters by failing pending sends.
*/
list_splice_init(&conn->resend_queue, &conn->send_queue);
list_for_each_entry_safe(msend, tmp, &conn->send_queue, head) {
if (scoutfs_forcing_unmount(sb))
call_resp_func(sb, conn, msend->resp_func, msend->resp_data,
NULL, 0, -ECONNABORTED);
free_msend(ninf, msend);
}
@@ -925,6 +941,8 @@ static int sock_opts_and_names(struct scoutfs_net_connection *conn,
ret = -EAFNOSUPPORT;
if (ret)
goto out;
conn->last_peername = conn->peername;
out:
return ret;
}
@@ -944,7 +962,6 @@ static void scoutfs_net_listen_worker(struct work_struct *work)
struct scoutfs_net_connection *acc_conn;
DECLARE_WAIT_QUEUE_HEAD(waitq);
struct socket *acc_sock;
LIST_HEAD(conn_list);
int ret;
trace_scoutfs_net_listen_work_enter(sb, 0, 0);
@@ -1206,6 +1223,7 @@ static void scoutfs_net_reconn_free_worker(struct work_struct *work)
unsigned long now = jiffies;
unsigned long deadline = 0;
bool requeue = false;
int ret;
trace_scoutfs_net_reconn_free_work_enter(sb, 0, 0);
@@ -1219,10 +1237,18 @@ restart:
time_after_eq(now, acc->reconn_deadline))) {
set_conn_fl(acc, reconn_freeing);
spin_unlock(&conn->lock);
if (!test_conn_fl(conn, shutting_down))
scoutfs_info(sb, "client timed out "SIN_FMT" -> "SIN_FMT", can not reconnect",
SIN_ARG(&acc->sockname),
SIN_ARG(&acc->peername));
if (!test_conn_fl(conn, shutting_down)) {
scoutfs_info(sb, "client "SIN_FMT" reconnect timed out, fencing",
SIN_ARG(&acc->last_peername));
ret = scoutfs_fence_start(sb, acc->rid,
acc->last_peername.sin_addr.s_addr,
SCOUTFS_FENCE_CLIENT_RECONNECT);
if (ret) {
scoutfs_err(sb, "client fence returned err %d, shutting down server",
ret);
scoutfs_server_abort(sb);
}
}
destroy_conn(acc);
goto restart;
}
@@ -1293,6 +1319,7 @@ scoutfs_net_alloc_conn(struct super_block *sb,
init_waitqueue_head(&conn->waitq);
conn->sockname.sin_family = AF_INET;
conn->peername.sin_family = AF_INET;
conn->last_peername.sin_family = AF_INET;
INIT_LIST_HEAD(&conn->accepted_head);
INIT_LIST_HEAD(&conn->accepted_list);
conn->next_send_seq = 1;

View File

@@ -49,6 +49,7 @@ struct scoutfs_net_connection {
u64 greeting_id;
struct sockaddr_in sockname;
struct sockaddr_in peername;
struct sockaddr_in last_peername;
struct list_head accepted_head;
struct scoutfs_net_connection *listening_conn;
@@ -99,6 +100,16 @@ static inline void scoutfs_addr_to_sin(struct sockaddr_in *sin,
sin->sin_port = cpu_to_be16(le16_to_cpu(addr->v4.port));
}
static inline void scoutfs_sin_to_addr(union scoutfs_inet_addr *addr, struct sockaddr_in *sin)
{
BUG_ON(sin->sin_family != AF_INET);
memset(addr, 0, sizeof(union scoutfs_inet_addr));
addr->v4.family = cpu_to_le16(SCOUTFS_AF_IPV4);
addr->v4.addr = be32_to_le32(sin->sin_addr.s_addr);
addr->v4.port = be16_to_le16(sin->sin_port);
}
struct scoutfs_net_connection *
scoutfs_net_alloc_conn(struct super_block *sb,
scoutfs_net_notify_t notify_up,

1052
kmod/src/omap.c Normal file

File diff suppressed because it is too large Load Diff

24
kmod/src/omap.h Normal file
View File

@@ -0,0 +1,24 @@
#ifndef _SCOUTFS_OMAP_H_
#define _SCOUTFS_OMAP_H_
int scoutfs_omap_inc(struct super_block *sb, u64 ino);
void scoutfs_omap_dec(struct super_block *sb, u64 ino);
int scoutfs_omap_should_delete(struct super_block *sb, struct inode *inode,
struct scoutfs_lock **lock_ret, struct scoutfs_lock **orph_lock_ret);
void scoutfs_omap_free_lock_data(struct scoutfs_omap_lock_data *ldata);
int scoutfs_omap_client_handle_request(struct super_block *sb, u64 id,
struct scoutfs_open_ino_map_args *args);
int scoutfs_omap_add_rid(struct super_block *sb, u64 rid);
int scoutfs_omap_remove_rid(struct super_block *sb, u64 rid);
int scoutfs_omap_finished_recovery(struct super_block *sb);
int scoutfs_omap_server_handle_request(struct super_block *sb, u64 rid, u64 id,
struct scoutfs_open_ino_map_args *args);
int scoutfs_omap_server_handle_response(struct super_block *sb, u64 rid,
struct scoutfs_open_ino_map *resp_map);
void scoutfs_omap_server_shutdown(struct super_block *sb);
int scoutfs_omap_setup(struct super_block *sb);
void scoutfs_omap_destroy(struct super_block *sb);
#endif

View File

@@ -32,6 +32,7 @@
#include "block.h"
#include "net.h"
#include "sysfs.h"
#include "fence.h"
#include "scoutfs_trace.h"
/*
@@ -60,10 +61,9 @@
* running (maybe they've deadlocked, or lost network communications).
* In addition to a configuration slot in the super block, each quorum
* member also has a known block location that represents their slot.
* They set a flag in their block indicating that they've been elected
* leader, then read slots for all the other blocks looking for
* previously active leaders to fence. After that it can start the
* server.
* The block contains an array of events which are updated during the life
* time of the quorum agent. The elected leader set its elected event
* and can then start the server.
*
* It's critical to raft elections that a participant's term not go
* backwards in time so each mount also uses its quorum block to store
@@ -334,17 +334,18 @@ static int recv_msg(struct super_block *sb, struct quorum_host_msg *msg,
}
/*
* The caller can provide a mark that they're using to track their
* written blocks. It's updated as they write the block and we can
* compare it with what we read to see if there have been unexpected
* intervening writes to the block -- the caller is supposed to have
* exclusive access to the block (or was fenced).
* Read and verify block fields before giving it to the caller. We
* should have exclusive write access to the block. We know that
* something has gone horribly wrong if we don't see our rid in the
* begin event after we've written it as we started up.
*/
static int read_quorum_block(struct super_block *sb, u64 blkno,
struct scoutfs_quorum_block *blk, __le64 *mark)
static int read_quorum_block(struct super_block *sb, u64 blkno, struct scoutfs_quorum_block *blk,
bool check_rid)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_super_block *super = &sbi->super;
const u64 rid = sbi->rid;
char msg[150];
__le32 crc;
int ret;
@@ -355,162 +356,208 @@ static int read_quorum_block(struct super_block *sb, u64 blkno,
ret = scoutfs_block_read_sm(sb, sbi->meta_bdev, blkno,
&blk->hdr, sizeof(*blk), &crc);
if (ret < 0) {
scoutfs_err(sb, "quorum block read error %d", ret);
goto out;
}
/* detect invalid blocks */
if (ret == 0 &&
((blk->hdr.crc != crc) ||
(le32_to_cpu(blk->hdr.magic) != SCOUTFS_BLOCK_MAGIC_QUORUM) ||
(blk->hdr.fsid != super->hdr.fsid) ||
(le64_to_cpu(blk->hdr.blkno) != blkno))) {
scoutfs_inc_counter(sb, quorum_read_invalid_block);
if (blk->hdr.crc != crc)
snprintf(msg, sizeof(msg), "blk crc %08x != %08x",
le32_to_cpu(blk->hdr.crc), le32_to_cpu(crc));
else if (le32_to_cpu(blk->hdr.magic) != SCOUTFS_BLOCK_MAGIC_QUORUM)
snprintf(msg, sizeof(msg), "blk magic %08x != %08x",
le32_to_cpu(blk->hdr.magic), SCOUTFS_BLOCK_MAGIC_QUORUM);
else if (blk->hdr.fsid != super->hdr.fsid)
snprintf(msg, sizeof(msg), "blk fsid %016llx != %016llx",
le64_to_cpu(blk->hdr.fsid), le64_to_cpu(super->hdr.fsid));
else if (le64_to_cpu(blk->hdr.blkno) != blkno)
snprintf(msg, sizeof(msg), "blk blkno %llu != %llu",
le64_to_cpu(blk->hdr.blkno), blkno);
else if (check_rid && le64_to_cpu(blk->events[SCOUTFS_QUORUM_EVENT_BEGIN].rid) != rid)
snprintf(msg, sizeof(msg), "quorum block begin rid %016llx != our rid %016llx, are multiple mounts configured with this slot?",
le64_to_cpu(blk->events[SCOUTFS_QUORUM_EVENT_BEGIN].rid), rid);
else
msg[0] = '\0';
if (msg[0] != '\0') {
scoutfs_err(sb, "read invalid quorum block, %s", msg);
ret = -EIO;
goto out;
}
if (mark && *mark != 0 && blk->random_write_mark != *mark) {
scoutfs_err(sb, "read unexpected quorum block write mark, are multiple mounts configured with the same slot?");
ret = -EIO;
}
if (ret < 0)
scoutfs_err(sb, "quorum block read error %d", ret);
out:
return ret;
}
static void set_quorum_block_event(struct super_block *sb,
struct scoutfs_quorum_block *blk,
struct scoutfs_quorum_block_event *ev)
static void set_quorum_block_event(struct super_block *sb, struct scoutfs_quorum_block *blk,
int event, u64 term)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_quorum_block_event *ev;
struct timespec64 ts;
if (WARN_ON_ONCE(event < 0 || event >= SCOUTFS_QUORUM_EVENT_NR))
return;
getnstimeofday64(&ts);
ev = &blk->events[event];
ev->rid = cpu_to_le64(sbi->rid);
ev->term = cpu_to_le64(term);
ev->ts.sec = cpu_to_le64(ts.tv_sec);
ev->ts.nsec = cpu_to_le32(ts.tv_nsec);
}
/*
* Every time we write a block we update the write stamp and random
* write mark so readers can see our write.
*/
static int write_quorum_block(struct super_block *sb, u64 blkno,
struct scoutfs_quorum_block *blk, __le64 *mark)
static int write_quorum_block(struct super_block *sb, u64 blkno, struct scoutfs_quorum_block *blk)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
int ret;
if (WARN_ON_ONCE(blkno < SCOUTFS_QUORUM_BLKNO) ||
WARN_ON_ONCE(blkno >= (SCOUTFS_QUORUM_BLKNO +
SCOUTFS_QUORUM_BLOCKS)))
return -EINVAL;
do {
get_random_bytes(&blk->random_write_mark,
sizeof(blk->random_write_mark));
} while (blk->random_write_mark == 0);
if (mark)
*mark = blk->random_write_mark;
set_quorum_block_event(sb, blk, &blk->write);
ret = scoutfs_block_write_sm(sb, sbi->meta_bdev, blkno,
&blk->hdr, sizeof(*blk));
if (ret < 0)
scoutfs_err(sb, "quorum block write error %d", ret);
return ret;
return scoutfs_block_write_sm(sb, sbi->meta_bdev, blkno, &blk->hdr, sizeof(*blk));
}
/*
* Read the caller's slot's current quorum block, make a change, and
* write it back out. If the caller provides a mark it can cause read
* errors if we read a mark that doesn't match the last mark that the
* caller wrote.
* Read the caller's slot's quorum block, make a change, and write it
* back out.
*/
static int update_quorum_block(struct super_block *sb, u64 blkno,
__le64 *mark, int role, u64 term)
static int update_quorum_block(struct super_block *sb, int event, u64 term, bool check_rid)
{
struct mount_options *opts = &SCOUTFS_SB(sb)->opts;
u64 blkno = SCOUTFS_QUORUM_BLKNO + opts->quorum_slot_nr;
struct scoutfs_quorum_block blk;
u64 flags;
u64 bits;
u64 set;
int ret;
ret = read_quorum_block(sb, blkno, &blk, mark);
ret = read_quorum_block(sb, blkno, &blk, check_rid);
if (ret == 0) {
if (blk.term != cpu_to_le64(term)) {
blk.term = cpu_to_le64(term);
set_quorum_block_event(sb, &blk, &blk.update_term);
}
flags = le64_to_cpu(blk.flags);
bits = SCOUTFS_QUORUM_BLOCK_LEADER;
set = role == LEADER ? SCOUTFS_QUORUM_BLOCK_LEADER : 0;
if ((flags & bits) != set)
set_quorum_block_event(sb, &blk,
set ? &blk.set_leader :
&blk.clear_leader);
blk.flags = cpu_to_le64((flags & ~bits) | set);
ret = write_quorum_block(sb, blkno, &blk, mark);
set_quorum_block_event(sb, &blk, event, term);
ret = write_quorum_block(sb, blkno, &blk);
if (ret < 0)
scoutfs_err(sb, "error %d reading quorum block %llu to update event %d term %llu",
ret, blkno, event, term);
} else {
scoutfs_err(sb, "error %d writing quorum block %llu after updating event %d term %llu",
ret, blkno, event, term);
}
return ret;
}
/*
* The calling server has fenced previous leaders and reclaimed their
* resources. We can now update our fence event with a greater term to
* stop future leaders from doing the same.
*/
int scoutfs_quorum_fence_complete(struct super_block *sb, u64 term)
{
return update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_FENCE, term, true);
}
/*
* The calling server has been elected and updated their block, but
* can't yet assume that it has exclusive access to the metadata device.
* We read all the quorum blocks looking for previously elected leaders
* to fence so that we're the only leader running.
* The calling server has been elected and has started running but can't
* yet assume that it has exclusive access to the metadata device. We
* read all the quorum blocks looking for previously elected leaders to
* fence so that we're the only leader running.
*
* We're relying on the invariant that there can't be two mounts running
* with the same slot nr at the same time. With this constraint there
* can be at most two previous leaders per slot that need to be fenced:
* a persistent record of an old mount on the slot, and an active mount.
*
* If we start fence requests then we only wait for them to complete
* before returning. The server will reclaim their resources once it is
* up and running and will call us to update the fence event. If we
* don't start fence requests then we update the fence event
* immediately, the server has nothing more to do.
*
* Quorum will be sending heartbeats while we wait for fencing. That
* keeps us from being fenced while we allow userspace fencing to take a
* reasonably long time. We still want to timeout eventually.
*/
static int fence_leader_blocks(struct super_block *sb)
int scoutfs_quorum_fence_leaders(struct super_block *sb, u64 term)
{
#define NR_OLD 2
struct scoutfs_quorum_block_event old[SCOUTFS_QUORUM_MAX_SLOTS][NR_OLD] = {{{0,}}};
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_super_block *super = &sbi->super;
struct mount_options *opts = &sbi->opts;
struct scoutfs_quorum_block blk;
struct sockaddr_in sin;
u64 blkno;
const u64 rid = sbi->rid;
bool fence_started = false;
u64 fenced = 0;
__le64 fence_rid;
int ret = 0;
int err;
int i;
int j;
BUILD_BUG_ON(SCOUTFS_QUORUM_BLOCKS < SCOUTFS_QUORUM_MAX_SLOTS);
for (i = 0; i < SCOUTFS_QUORUM_MAX_SLOTS; i++) {
if (i == opts->quorum_slot_nr)
if (!quorum_slot_present(super, i))
continue;
blkno = SCOUTFS_QUORUM_BLKNO + i;
ret = read_quorum_block(sb, blkno, &blk, NULL);
ret = read_quorum_block(sb, SCOUTFS_QUORUM_BLKNO + i, &blk, false);
if (ret < 0)
goto out;
if (!(le64_to_cpu(blk.flags) & SCOUTFS_QUORUM_BLOCK_LEADER))
continue;
/* elected leader still running */
if (le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_ELECT].term) >
le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_STOP].term))
old[i][0] = blk.events[SCOUTFS_QUORUM_EVENT_ELECT];
scoutfs_inc_counter(sb, quorum_fence_leader);
scoutfs_quorum_slot_sin(super, i, &sin);
/* persistent record of previous server before elected */
if ((le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_FENCE].term) >
le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_STOP].term)) &&
(le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_FENCE].term) <
le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_ELECT].term)))
old[i][1] = blk.events[SCOUTFS_QUORUM_EVENT_FENCE];
scoutfs_err(sb, "fencing "SCSBF" at "SIN_FMT,
SCSB_LEFR_ARGS(super->hdr.fsid, blk.set_leader.rid),
SIN_ARG(&sin));
/* find greatest term that has fenced everything before it */
fenced = max(fenced, le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_FENCE].term));
}
blk.flags &= ~cpu_to_le64(SCOUTFS_QUORUM_BLOCK_LEADER);
set_quorum_block_event(sb, &blk, &blk.fenced);
/* now actually fence any old leaders which haven't been fenced yet */
for (i = 0; i < SCOUTFS_QUORUM_MAX_SLOTS; i++) {
for (j = 0; j < NR_OLD; j++) {
if (le64_to_cpu(old[i][j].term) == 0 || /* uninitialized */
le64_to_cpu(old[i][j].term) < fenced || /* already fenced */
le64_to_cpu(old[i][j].term) > term || /* newer than us */
le64_to_cpu(old[i][j].rid) == rid) /* us */
continue;
ret = write_quorum_block(sb, blkno, &blk, NULL);
if (ret < 0)
goto out;
scoutfs_inc_counter(sb, quorum_fence_leader);
scoutfs_quorum_slot_sin(super, i, &sin);
fence_rid = old[i][j].rid;
scoutfs_info(sb, "fencing previous leader "SCSBF" at term %llu in slot %u with address "SIN_FMT,
SCSB_LEFR_ARGS(super->hdr.fsid, fence_rid),
le64_to_cpu(old[i][j].term), i, SIN_ARG(&sin));
ret = scoutfs_fence_start(sb, le64_to_cpu(fence_rid), sin.sin_addr.s_addr,
SCOUTFS_FENCE_QUORUM_BLOCK_LEADER);
if (ret < 0)
goto out;
fence_started = true;
}
}
out:
if (fence_started) {
err = scoutfs_fence_wait_fenced(sb, msecs_to_jiffies(SCOUTFS_QUORUM_FENCE_TO_MS));
if (ret == 0)
ret = err;
} else {
err = scoutfs_quorum_fence_complete(sb, term);
if (ret == 0)
ret = err;
}
if (ret < 0) {
scoutfs_err(sb, "error %d fencing active", ret);
scoutfs_err(sb, "error %d attempting to find and fence previous leaders", ret);
scoutfs_inc_counter(sb, quorum_fence_error);
}
@@ -533,23 +580,22 @@ static void scoutfs_quorum_worker(struct work_struct *work)
struct sockaddr_in unused;
struct quorum_host_msg msg;
struct quorum_status qst;
__le64 mark;
u64 blkno;
int ret;
int err;
/* recording votes from slots as native single word bitmap */
BUILD_BUG_ON(SCOUTFS_QUORUM_MAX_SLOTS > BITS_PER_LONG);
/* get our starting term from our persistent block */
mark = 0;
blkno = SCOUTFS_QUORUM_BLKNO + opts->quorum_slot_nr;
ret = read_quorum_block(sb, blkno, &blk, &mark);
ret = read_quorum_block(sb, blkno, &blk, false);
if (ret < 0)
goto out;
/* start out as a follower */
qst.role = FOLLOWER;
qst.term = le64_to_cpu(blk.term);
qst.term = le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_TERM].term);
qst.vote_for = -1;
qst.vote_bits = 0;
@@ -559,6 +605,11 @@ static void scoutfs_quorum_worker(struct work_struct *work)
else
qst.timeout = election_timeout();
/* record that we're up and running, readers check that it isn't updated */
ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_BEGIN, qst.term, false);
if (ret < 0)
goto out;
while (!qinf->shutdown) {
ret = recv_msg(sb, &msg, qst.timeout);
@@ -589,11 +640,6 @@ static void scoutfs_quorum_worker(struct work_struct *work)
send_msg_others(sb, SCOUTFS_QUORUM_MSG_RESIGNATION,
qst.term);
scoutfs_inc_counter(sb, quorum_send_resignation);
ret = update_quorum_block(sb, blkno, &mark,
qst.role, qst.term);
if (ret < 0)
goto out;
}
spin_lock(&qinf->show_lock);
@@ -624,8 +670,7 @@ static void scoutfs_quorum_worker(struct work_struct *work)
qst.timeout = election_timeout();
/* store our increased term */
ret = update_quorum_block(sb, blkno, &mark,
qst.role, qst.term);
ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_TERM, qst.term, true);
if (ret < 0)
goto out;
}
@@ -642,6 +687,11 @@ static void scoutfs_quorum_worker(struct work_struct *work)
qst.term);
qst.timeout = election_timeout();
scoutfs_inc_counter(sb, quorum_send_request);
/* store our increased term */
ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_TERM, qst.term, true);
if (ret < 0)
goto out;
}
/* candidates count votes in their term */
@@ -670,10 +720,8 @@ static void scoutfs_quorum_worker(struct work_struct *work)
qst.term);
qst.timeout = heartbeat_interval();
/* set our leader flag and fence */
ret = update_quorum_block(sb, blkno, &mark,
qst.role, qst.term) ?:
fence_leader_blocks(sb);
/* record that we've been elected before starting up server */
ret = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_ELECT, qst.term, true);
if (ret < 0)
goto out;
@@ -684,8 +732,13 @@ static void scoutfs_quorum_worker(struct work_struct *work)
ret = scoutfs_server_start(sb, qst.term);
if (ret < 0) {
scoutfs_err(sb, "server startup failed with %d",
ret);
clear_bit(QINF_FLAG_SERVER, &qinf->flags);
scoutfs_err(sb, "server startup failed with %d", ret);
/* store our increased term */
err = update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_STOP, qst.term,
true);
if (err < 0 && ret == 0)
ret = err;
goto out;
}
}
@@ -727,17 +780,13 @@ static void scoutfs_quorum_worker(struct work_struct *work)
/* always try to stop a running server as we stop */
if (test_bit(QINF_FLAG_SERVER, &qinf->flags)) {
scoutfs_server_stop(sb);
scoutfs_fence_stop(sb);
send_msg_others(sb, SCOUTFS_QUORUM_MSG_RESIGNATION,
qst.term);
}
/* always try to clear leader block as we stop to avoid fencing */
if (qst.role == LEADER) {
ret = update_quorum_block(sb, blkno, &mark,
FOLLOWER, qst.term);
if (ret < 0)
goto out;
}
/* informational event that we're shutting down, nothing relies on it */
update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_END, qst.term, true);
out:
if (ret < 0) {
scoutfs_err(sb, "quorum service saw error %d, shutting down. Cluster will be degraded until this slot is remounted to restart the quorum service",
@@ -746,58 +795,60 @@ out:
}
/*
* Set a flag for the quorum work's next iteration to indicate that the
* server has shutdown and that it should step down as leader, update
* quorum blocks, and stop sending heartbeats.
* The calling server has shutdown and is no longer using shared
* resources. Clear the bit so that we stop sending heartbeats and
* allow the next server to be elected. Update the stop event so that
* it won't be considered available by clients or fenced by the next
* leader.
*/
void scoutfs_quorum_server_shutdown(struct super_block *sb)
void scoutfs_quorum_server_shutdown(struct super_block *sb, u64 term)
{
DECLARE_QUORUM_INFO(sb, qinf);
set_bit(QINF_FLAG_SERVER, &qinf->flags);
clear_bit(QINF_FLAG_SERVER, &qinf->flags);
update_quorum_block(sb, SCOUTFS_QUORUM_EVENT_STOP, term, true);
}
/*
* Clients read quorum blocks looking for the leader with a server whose
* address it can try and connect to.
*
* There can be multiple running servers if a client checks before a
* server has had a chance to fence any old servers. We try to use the
* block with the most recent timestamp. If we get it wrong the
* connection will timeout and the client will try again, presumably
* finding a single server block.
* There can be records of multiple previous elected leaders if the
* current server hasn't yet fenced any old servers. We use the elected
* leader with the greatest elected term. If we get it wrong the
* connection will timeout and the client will try again.
*/
int scoutfs_quorum_server_sin(struct super_block *sb, struct sockaddr_in *sin)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct scoutfs_super_block *super = &sbi->super;
struct scoutfs_quorum_block blk;
struct timespec64 recent = {0,};
struct timespec64 ts;
int ret;
u64 elect_term;
u64 term = 0;
int ret = 0;
int i;
for (i = 0; i < SCOUTFS_QUORUM_MAX_SLOTS; i++) {
ret = read_quorum_block(sb, SCOUTFS_QUORUM_BLKNO + i, &blk,
NULL);
if (!quorum_slot_present(super, i))
continue;
ret = read_quorum_block(sb, SCOUTFS_QUORUM_BLKNO + i, &blk, false);
if (ret < 0) {
scoutfs_err(sb, "error reading quorum block nr %u: %d",
i, ret);
goto out;
}
ts.tv_sec = le64_to_cpu(blk.set_leader.ts.sec);
ts.tv_nsec = le32_to_cpu(blk.set_leader.ts.nsec);
if ((le64_to_cpu(blk.flags) & SCOUTFS_QUORUM_BLOCK_LEADER) &&
(timespec64_to_ns(&ts) > timespec64_to_ns(&recent))) {
recent = ts;
elect_term = le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_ELECT].term);
if (elect_term > term &&
elect_term > le64_to_cpu(blk.events[SCOUTFS_QUORUM_EVENT_STOP].term)) {
term = elect_term;
scoutfs_quorum_slot_sin(super, i, sin);
continue;
}
}
if (timespec64_to_ns(&recent) == 0)
if (term == 0)
ret = -ENOENT;
out:

View File

@@ -2,12 +2,15 @@
#define _SCOUTFS_QUORUM_H_
int scoutfs_quorum_server_sin(struct super_block *sb, struct sockaddr_in *sin);
void scoutfs_quorum_server_shutdown(struct super_block *sb);
void scoutfs_quorum_server_shutdown(struct super_block *sb, u64 term);
u8 scoutfs_quorum_votes_needed(struct super_block *sb);
void scoutfs_quorum_slot_sin(struct scoutfs_super_block *super, int i,
struct sockaddr_in *sin);
int scoutfs_quorum_fence_leaders(struct super_block *sb, u64 term);
int scoutfs_quorum_fence_complete(struct super_block *sb, u64 term);
int scoutfs_quorum_setup(struct super_block *sb);
void scoutfs_quorum_shutdown(struct super_block *sb);
void scoutfs_quorum_destroy(struct super_block *sb);

305
kmod/src/recov.c Normal file
View File

@@ -0,0 +1,305 @@
/*
* Copyright (C) 2021 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/rhashtable.h>
#include <linux/rcupdate.h>
#include <linux/list_sort.h>
#include "super.h"
#include "recov.h"
#include "cmp.h"
/*
* There are a few server messages which can't be processed until they
* know that they have state for all possibly active clients. These
* little helpers track which clients have recovered what state and give
* those message handlers a call to check if recovery has completed. We
* track the timeout here, but all we do is call back into the server to
* take steps to evict timed out clients and then let us know that their
* recovery has finished.
*/
struct recov_info {
struct super_block *sb;
spinlock_t lock;
struct list_head pending;
struct timer_list timer;
void (*timeout_fn)(struct super_block *);
};
#define DECLARE_RECOV_INFO(sb, name) \
struct recov_info *name = SCOUTFS_SB(sb)->recov_info
struct recov_pending {
struct list_head head;
u64 rid;
int which;
};
static struct recov_pending *next_pending(struct recov_info *recinf, u64 rid, int which)
{
struct recov_pending *pend;
list_for_each_entry(pend, &recinf->pending, head) {
if (pend->rid > rid && pend->which & which)
return pend;
}
return NULL;
}
static struct recov_pending *lookup_pending(struct recov_info *recinf, u64 rid, int which)
{
struct recov_pending *pend;
pend = next_pending(recinf, rid - 1, which);
if (pend && pend->rid == rid)
return pend;
return NULL;
}
/*
* We keep the pending list sorted by rid so that we can iterate over
* them. The list should be small and shouldn't be used often.
*/
static int cmp_pending_rid(void *priv, struct list_head *A, struct list_head *B)
{
struct recov_pending *a = list_entry(A, struct recov_pending, head);
struct recov_pending *b = list_entry(B, struct recov_pending, head);
return scoutfs_cmp_u64s(a->rid, b->rid);
}
/*
* Record that we'll be waiting for a client to recover something.
* _finished will eventually be called for every _prepare, either
* because recovery naturally finished or because it timed out and the
* server evicted the client.
*/
int scoutfs_recov_prepare(struct super_block *sb, u64 rid, int which)
{
DECLARE_RECOV_INFO(sb, recinf);
struct recov_pending *alloc;
struct recov_pending *pend;
if (WARN_ON_ONCE(which & SCOUTFS_RECOV_INVALID))
return -EINVAL;
alloc = kmalloc(sizeof(*pend), GFP_NOFS);
if (!alloc)
return -ENOMEM;
spin_lock(&recinf->lock);
pend = lookup_pending(recinf, rid, SCOUTFS_RECOV_ALL);
if (pend) {
pend->which |= which;
} else {
swap(pend, alloc);
pend->rid = rid;
pend->which = which;
list_add_tail(&pend->head, &recinf->pending);
list_sort(NULL, &recinf->pending, cmp_pending_rid);
}
spin_unlock(&recinf->lock);
kfree(alloc);
return 0;
}
/*
* Recovery is only finished once we've begun (which sets the timer) and
* all clients have finished. If we didn't test the timer we could
* claim it finished prematurely as clients are being prepared.
*/
static int recov_finished(struct recov_info *recinf)
{
return !!(recinf->timeout_fn != NULL && list_empty(&recinf->pending));
}
static void timer_callback(struct timer_list *timer)
{
struct recov_info *recinf = from_timer(recinf, timer, timer);
recinf->timeout_fn(recinf->sb);
}
/*
* Begin waiting for recovery once we've prepared all the clients. If
* the timeout period elapses before _finish is called on all prepared
* clients then the timer will call the callback.
*
* Returns > 0 if all the prepared clients finish recovery before begin
* is called.
*/
int scoutfs_recov_begin(struct super_block *sb, void (*timeout_fn)(struct super_block *),
unsigned int timeout_ms)
{
DECLARE_RECOV_INFO(sb, recinf);
int ret;
spin_lock(&recinf->lock);
recinf->timeout_fn = timeout_fn;
recinf->timer.expires = jiffies + msecs_to_jiffies(timeout_ms);
add_timer(&recinf->timer);
ret = recov_finished(recinf);
spin_unlock(&recinf->lock);
if (ret > 0)
del_timer_sync(&recinf->timer);
return ret;
}
/*
* A given client has recovered the given state. If it's finished all
* recovery then we free it, and if all clients have finished recovery
* then we cancel the timeout timer.
*
* Returns > 0 if _begin has been called and all clients have finished.
* The caller will only see > 0 returned once.
*/
int scoutfs_recov_finish(struct super_block *sb, u64 rid, int which)
{
DECLARE_RECOV_INFO(sb, recinf);
struct recov_pending *pend;
int ret = 0;
spin_lock(&recinf->lock);
pend = lookup_pending(recinf, rid, which);
if (pend) {
pend->which &= ~which;
if (pend->which) {
pend = NULL;
} else {
list_del(&pend->head);
ret = recov_finished(recinf);
}
}
spin_unlock(&recinf->lock);
if (ret > 0)
del_timer_sync(&recinf->timer);
kfree(pend);
return ret;
}
/*
* Returns true if the given client is still trying to recover
* the given state.
*/
bool scoutfs_recov_is_pending(struct super_block *sb, u64 rid, int which)
{
DECLARE_RECOV_INFO(sb, recinf);
bool is_pending;
spin_lock(&recinf->lock);
is_pending = lookup_pending(recinf, rid, which) != NULL;
spin_unlock(&recinf->lock);
return is_pending;
}
/*
* Return the next rid after the given rid of a client waiting for the
* given state to be recovered. Start with rid 0, returns 0 when there
* are no more clients waiting for recovery.
*
* This is inherently racey. Callers are responsible for resolving any
* actions taken based on pending with the recovery finishing, perhaps
* before we return.
*/
u64 scoutfs_recov_next_pending(struct super_block *sb, u64 rid, int which)
{
DECLARE_RECOV_INFO(sb, recinf);
struct recov_pending *pend;
spin_lock(&recinf->lock);
pend = next_pending(recinf, rid, which);
rid = pend ? pend->rid : 0;
spin_unlock(&recinf->lock);
return rid;
}
/*
* The server is shutting down and doesn't need to worry about recovery
* anymore. It'll be built up again by the next server, if needed.
*/
void scoutfs_recov_shutdown(struct super_block *sb)
{
DECLARE_RECOV_INFO(sb, recinf);
struct recov_pending *pend;
struct recov_pending *tmp;
LIST_HEAD(list);
del_timer_sync(&recinf->timer);
spin_lock(&recinf->lock);
list_splice_init(&recinf->pending, &list);
recinf->timeout_fn = NULL;
spin_unlock(&recinf->lock);
list_for_each_entry_safe(pend, tmp, &recinf->pending, head) {
list_del(&pend->head);
kfree(pend);
}
}
int scoutfs_recov_setup(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct recov_info *recinf;
int ret;
recinf = kzalloc(sizeof(struct recov_info), GFP_KERNEL);
if (!recinf) {
ret = -ENOMEM;
goto out;
}
recinf->sb = sb;
spin_lock_init(&recinf->lock);
INIT_LIST_HEAD(&recinf->pending);
timer_setup(&recinf->timer, timer_callback, 0);
sbi->recov_info = recinf;
ret = 0;
out:
return ret;
}
void scoutfs_recov_destroy(struct super_block *sb)
{
DECLARE_RECOV_INFO(sb, recinf);
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
if (recinf) {
scoutfs_recov_shutdown(sb);
kfree(recinf);
sbi->recov_info = NULL;
}
}

23
kmod/src/recov.h Normal file
View File

@@ -0,0 +1,23 @@
#ifndef _SCOUTFS_RECOV_H_
#define _SCOUTFS_RECOV_H_
enum {
SCOUTFS_RECOV_GREETING = ( 1 << 0),
SCOUTFS_RECOV_LOCKS = ( 1 << 1),
SCOUTFS_RECOV_INVALID = (~0 << 2),
SCOUTFS_RECOV_ALL = (~SCOUTFS_RECOV_INVALID),
};
int scoutfs_recov_prepare(struct super_block *sb, u64 rid, int which);
int scoutfs_recov_begin(struct super_block *sb, void (*timeout_fn)(struct super_block *),
unsigned int timeout_ms);
int scoutfs_recov_finish(struct super_block *sb, u64 rid, int which);
bool scoutfs_recov_is_pending(struct super_block *sb, u64 rid, int which);
u64 scoutfs_recov_next_pending(struct super_block *sb, u64 rid, int which);
void scoutfs_recov_shutdown(struct super_block *sb);
int scoutfs_recov_setup(struct super_block *sb);
void scoutfs_recov_destroy(struct super_block *sb);
#endif

View File

@@ -424,14 +424,15 @@ TRACE_EVENT(scoutfs_trans_write_func,
);
DECLARE_EVENT_CLASS(scoutfs_trans_hold_release_class,
TP_PROTO(struct super_block *sb, void *journal_info, int holders),
TP_PROTO(struct super_block *sb, void *journal_info, int holders, int ret),
TP_ARGS(sb, journal_info, holders),
TP_ARGS(sb, journal_info, holders, ret),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(unsigned long, journal_info)
__field(int, holders)
__field(int, ret)
),
TP_fast_assign(
@@ -440,17 +441,17 @@ DECLARE_EVENT_CLASS(scoutfs_trans_hold_release_class,
__entry->holders = holders;
),
TP_printk(SCSBF" journal_info 0x%0lx holders %d",
SCSB_TRACE_ARGS, __entry->journal_info, __entry->holders)
TP_printk(SCSBF" journal_info 0x%0lx holders %d ret %d",
SCSB_TRACE_ARGS, __entry->journal_info, __entry->holders, __entry->ret)
);
DEFINE_EVENT(scoutfs_trans_hold_release_class, scoutfs_trans_acquired_hold,
TP_PROTO(struct super_block *sb, void *journal_info, int holders),
TP_ARGS(sb, journal_info, holders)
DEFINE_EVENT(scoutfs_trans_hold_release_class, scoutfs_hold_trans,
TP_PROTO(struct super_block *sb, void *journal_info, int holders, int ret),
TP_ARGS(sb, journal_info, holders, ret)
);
DEFINE_EVENT(scoutfs_trans_hold_release_class, scoutfs_release_trans,
TP_PROTO(struct super_block *sb, void *journal_info, int holders),
TP_ARGS(sb, journal_info, holders)
TP_PROTO(struct super_block *sb, void *journal_info, int holders, int ret),
TP_ARGS(sb, journal_info, holders, ret)
);
TRACE_EVENT(scoutfs_ioc_release,
@@ -690,15 +691,16 @@ TRACE_EVENT(scoutfs_evict_inode,
TRACE_EVENT(scoutfs_drop_inode,
TP_PROTO(struct super_block *sb, __u64 ino, unsigned int nlink,
unsigned int unhashed),
unsigned int unhashed, bool drop_invalidated),
TP_ARGS(sb, ino, nlink, unhashed),
TP_ARGS(sb, ino, nlink, unhashed, drop_invalidated),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, ino)
__field(unsigned int, nlink)
__field(unsigned int, unhashed)
__field(unsigned int, drop_invalidated)
),
TP_fast_assign(
@@ -706,10 +708,12 @@ TRACE_EVENT(scoutfs_drop_inode,
__entry->ino = ino;
__entry->nlink = nlink;
__entry->unhashed = unhashed;
__entry->drop_invalidated = !!drop_invalidated;
),
TP_printk(SCSBF" ino %llu nlink %u unhashed %d", SCSB_TRACE_ARGS,
__entry->ino, __entry->nlink, __entry->unhashed)
TP_printk(SCSBF" ino %llu nlink %u unhashed %d drop_invalidated %u", SCSB_TRACE_ARGS,
__entry->ino, __entry->nlink, __entry->unhashed,
__entry->drop_invalidated)
);
TRACE_EVENT(scoutfs_inode_walk_writeback,
@@ -982,22 +986,6 @@ TRACE_EVENT(scoutfs_delete_inode,
__entry->mode, __entry->size)
);
TRACE_EVENT(scoutfs_scan_orphans,
TP_PROTO(struct super_block *sb),
TP_ARGS(sb),
TP_STRUCT__entry(
__field(dev_t, dev)
),
TP_fast_assign(
__entry->dev = sb->s_dev;
),
TP_printk("dev %d,%d", MAJOR(__entry->dev), MINOR(__entry->dev))
);
DECLARE_EVENT_CLASS(scoutfs_key_class,
TP_PROTO(struct super_block *sb, struct scoutfs_key *key),
TP_ARGS(sb, key),
@@ -1641,6 +1629,164 @@ TRACE_EVENT(scoutfs_btree_walk,
__entry->level, __entry->ref_blkno, __entry->ref_seq)
);
TRACE_EVENT(scoutfs_btree_set_parent,
TP_PROTO(struct super_block *sb,
struct scoutfs_btree_root *root, struct scoutfs_key *key,
struct scoutfs_btree_root *par_root),
TP_ARGS(sb, root, key, par_root),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, root_blkno)
__field(__u64, root_seq)
__field(__u8, root_height)
sk_trace_define(key)
__field(__u64, par_root_blkno)
__field(__u64, par_root_seq)
__field(__u8, par_root_height)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->root_blkno = le64_to_cpu(root->ref.blkno);
__entry->root_seq = le64_to_cpu(root->ref.seq);
__entry->root_height = root->height;
sk_trace_assign(key, key);
__entry->par_root_blkno = le64_to_cpu(par_root->ref.blkno);
__entry->par_root_seq = le64_to_cpu(par_root->ref.seq);
__entry->par_root_height = par_root->height;
),
TP_printk(SCSBF" root blkno %llu seq %llu height %u, key "SK_FMT", par_root blkno %llu seq %llu height %u",
SCSB_TRACE_ARGS, __entry->root_blkno, __entry->root_seq,
__entry->root_height, sk_trace_args(key),
__entry->par_root_blkno, __entry->par_root_seq,
__entry->par_root_height)
);
TRACE_EVENT(scoutfs_btree_merge,
TP_PROTO(struct super_block *sb, struct scoutfs_btree_root *root,
struct scoutfs_key *start, struct scoutfs_key *end),
TP_ARGS(sb, root, start, end),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, root_blkno)
__field(__u64, root_seq)
__field(__u8, root_height)
sk_trace_define(start)
sk_trace_define(end)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->root_blkno = le64_to_cpu(root->ref.blkno);
__entry->root_seq = le64_to_cpu(root->ref.seq);
__entry->root_height = root->height;
sk_trace_assign(start, start);
sk_trace_assign(end, end);
),
TP_printk(SCSBF" root blkno %llu seq %llu height %u start "SK_FMT" end "SK_FMT,
SCSB_TRACE_ARGS, __entry->root_blkno, __entry->root_seq,
__entry->root_height, sk_trace_args(start),
sk_trace_args(end))
);
TRACE_EVENT(scoutfs_btree_merge_items,
TP_PROTO(struct super_block *sb,
struct scoutfs_btree_root *m_root,
struct scoutfs_key *m_key, int m_val_len,
struct scoutfs_btree_root *f_root,
struct scoutfs_key *f_key, int f_val_len,
int is_del),
TP_ARGS(sb, m_root, m_key, m_val_len, f_root, f_key, f_val_len, is_del),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, m_root_blkno)
__field(__u64, m_root_seq)
__field(__u8, m_root_height)
sk_trace_define(m_key)
__field(int, m_val_len)
__field(__u64, f_root_blkno)
__field(__u64, f_root_seq)
__field(__u8, f_root_height)
sk_trace_define(f_key)
__field(int, f_val_len)
__field(int, is_del)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->m_root_blkno = m_root ?
le64_to_cpu(m_root->ref.blkno) : 0;
__entry->m_root_seq = m_root ? le64_to_cpu(m_root->ref.seq) : 0;
__entry->m_root_height = m_root ? m_root->height : 0;
sk_trace_assign(m_key, m_key);
__entry->m_val_len = m_val_len;
__entry->f_root_blkno = f_root ?
le64_to_cpu(f_root->ref.blkno) : 0;
__entry->f_root_seq = f_root ? le64_to_cpu(f_root->ref.seq) : 0;
__entry->f_root_height = f_root ? f_root->height : 0;
sk_trace_assign(f_key, f_key);
__entry->f_val_len = f_val_len;
__entry->is_del = !!is_del;
),
TP_printk(SCSBF" merge item root blkno %llu seq %llu height %u key "SK_FMT" val_len %d, fs item root blkno %llu seq %llu height %u key "SK_FMT" val_len %d, is_del %d",
SCSB_TRACE_ARGS, __entry->m_root_blkno, __entry->m_root_seq,
__entry->m_root_height, sk_trace_args(m_key),
__entry->m_val_len, __entry->f_root_blkno,
__entry->f_root_seq, __entry->f_root_height,
sk_trace_args(f_key), __entry->f_val_len, __entry->is_del)
);
DECLARE_EVENT_CLASS(scoutfs_btree_free_blocks,
TP_PROTO(struct super_block *sb, struct scoutfs_btree_root *root,
u64 blkno),
TP_ARGS(sb, root, blkno),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, root_blkno)
__field(__u64, root_seq)
__field(__u8, root_height)
__field(__u64, blkno)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->root_blkno = le64_to_cpu(root->ref.blkno);
__entry->root_seq = le64_to_cpu(root->ref.seq);
__entry->root_height = root->height;
__entry->blkno = blkno;
),
TP_printk(SCSBF" root blkno %llu seq %llu height %u, free blkno %llu",
SCSB_TRACE_ARGS, __entry->root_blkno, __entry->root_seq,
__entry->root_height, __entry->blkno)
);
DEFINE_EVENT(scoutfs_btree_free_blocks, scoutfs_btree_free_blocks_single,
TP_PROTO(struct super_block *sb, struct scoutfs_btree_root *root,
u64 blkno),
TP_ARGS(sb, root, blkno)
);
DEFINE_EVENT(scoutfs_btree_free_blocks, scoutfs_btree_free_blocks_leaf,
TP_PROTO(struct super_block *sb, struct scoutfs_btree_root *root,
u64 blkno),
TP_ARGS(sb, root, blkno)
);
DEFINE_EVENT(scoutfs_btree_free_blocks, scoutfs_btree_free_blocks_parent,
TP_PROTO(struct super_block *sb, struct scoutfs_btree_root *root,
u64 blkno),
TP_ARGS(sb, root, blkno)
);
TRACE_EVENT(scoutfs_online_offline_blocks,
TP_PROTO(struct inode *inode, s64 on_delta, s64 off_delta,
u64 on_now, u64 off_now),
@@ -1897,6 +2043,116 @@ TRACE_EVENT(scoutfs_trans_seq_last,
SCSB_TRACE_ARGS, __entry->s_rid, __entry->trans_seq)
);
TRACE_EVENT(scoutfs_get_log_merge_status,
TP_PROTO(struct super_block *sb, u64 rid, struct scoutfs_key *next_range_key,
u64 nr_requests, u64 nr_complete, u64 last_seq, u64 seq),
TP_ARGS(sb, rid, next_range_key, nr_requests, nr_complete, last_seq, seq),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, s_rid)
sk_trace_define(next_range_key)
__field(__u64, nr_requests)
__field(__u64, nr_complete)
__field(__u64, last_seq)
__field(__u64, seq)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->s_rid = rid;
sk_trace_assign(next_range_key, next_range_key);
__entry->nr_requests = nr_requests;
__entry->nr_complete = nr_complete;
__entry->last_seq = last_seq;
__entry->seq = seq;
),
TP_printk(SCSBF" rid %016llx next_range_key "SK_FMT" nr_requests %llu nr_complete %llu last_seq %llu seq %llu",
SCSB_TRACE_ARGS, __entry->s_rid, sk_trace_args(next_range_key),
__entry->nr_requests, __entry->nr_complete, __entry->last_seq, __entry->seq)
);
TRACE_EVENT(scoutfs_get_log_merge_request,
TP_PROTO(struct super_block *sb, u64 rid,
struct scoutfs_btree_root *root, struct scoutfs_key *start,
struct scoutfs_key *end, u64 last_seq, u64 seq),
TP_ARGS(sb, rid, root, start, end, last_seq, seq),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, s_rid)
__field(__u64, root_blkno)
__field(__u64, root_seq)
__field(__u8, root_height)
sk_trace_define(start)
sk_trace_define(end)
__field(__u64, last_seq)
__field(__u64, seq)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->s_rid = rid;
__entry->root_blkno = le64_to_cpu(root->ref.blkno);
__entry->root_seq = le64_to_cpu(root->ref.seq);
__entry->root_height = root->height;
sk_trace_assign(start, start);
sk_trace_assign(end, end);
__entry->last_seq = last_seq;
__entry->seq = seq;
),
TP_printk(SCSBF" rid %016llx root blkno %llu seq %llu height %u start "SK_FMT" end "SK_FMT" last_seq %llu seq %llu",
SCSB_TRACE_ARGS, __entry->s_rid, __entry->root_blkno,
__entry->root_seq, __entry->root_height,
sk_trace_args(start), sk_trace_args(end), __entry->last_seq,
__entry->seq)
);
TRACE_EVENT(scoutfs_get_log_merge_complete,
TP_PROTO(struct super_block *sb, u64 rid,
struct scoutfs_btree_root *root, struct scoutfs_key *start,
struct scoutfs_key *end, struct scoutfs_key *remain,
u64 seq, u64 flags),
TP_ARGS(sb, rid, root, start, end, remain, seq, flags),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, s_rid)
__field(__u64, root_blkno)
__field(__u64, root_seq)
__field(__u8, root_height)
sk_trace_define(start)
sk_trace_define(end)
sk_trace_define(remain)
__field(__u64, seq)
__field(__u64, flags)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->s_rid = rid;
__entry->root_blkno = le64_to_cpu(root->ref.blkno);
__entry->root_seq = le64_to_cpu(root->ref.seq);
__entry->root_height = root->height;
sk_trace_assign(start, start);
sk_trace_assign(end, end);
sk_trace_assign(remain, remain);
__entry->seq = seq;
__entry->flags = flags;
),
TP_printk(SCSBF" rid %016llx root blkno %llu seq %llu height %u start "SK_FMT" end "SK_FMT" remain "SK_FMT" seq %llu flags 0x%llx",
SCSB_TRACE_ARGS, __entry->s_rid, __entry->root_blkno,
__entry->root_seq, __entry->root_height,
sk_trace_args(start), sk_trace_args(end),
sk_trace_args(remain), __entry->seq, __entry->flags)
);
DECLARE_EVENT_CLASS(scoutfs_forest_bloom_class,
TP_PROTO(struct super_block *sb, struct scoutfs_key *key,
u64 rid, u64 nr, u64 blkno, u64 seq, unsigned int count),
@@ -2402,6 +2658,89 @@ TRACE_EVENT(scoutfs_item_invalidate_page,
sk_trace_args(pg_start), sk_trace_args(pg_end), __entry->pgi)
);
DECLARE_EVENT_CLASS(scoutfs_omap_group_class,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(void *, grp)
__field(__u64, group_nr)
__field(unsigned int, group_total)
__field(int, bit_nr)
__field(int, bit_count)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->grp = grp;
__entry->group_nr = group_nr;
__entry->group_total = group_total;
__entry->bit_nr = bit_nr;
__entry->bit_count = bit_count;
),
TP_printk(SCSBF" grp %p group_nr %llu group_total %u bit_nr %d bit_count %d",
SCSB_TRACE_ARGS, __entry->grp, __entry->group_nr, __entry->group_total,
__entry->bit_nr, __entry->bit_count)
);
DEFINE_EVENT(scoutfs_omap_group_class, scoutfs_omap_group_alloc,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count)
);
DEFINE_EVENT(scoutfs_omap_group_class, scoutfs_omap_group_free,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count)
);
DEFINE_EVENT(scoutfs_omap_group_class, scoutfs_omap_group_inc,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count)
);
DEFINE_EVENT(scoutfs_omap_group_class, scoutfs_omap_group_dec,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count)
);
DEFINE_EVENT(scoutfs_omap_group_class, scoutfs_omap_group_request,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count)
);
DEFINE_EVENT(scoutfs_omap_group_class, scoutfs_omap_group_destroy,
TP_PROTO(struct super_block *sb, void *grp, u64 group_nr, unsigned int group_total,
int bit_nr, int bit_count),
TP_ARGS(sb, grp, group_nr, group_total, bit_nr, bit_count)
);
TRACE_EVENT(scoutfs_omap_should_delete,
TP_PROTO(struct super_block *sb, u64 ino, unsigned int nlink, int ret),
TP_ARGS(sb, ino, nlink, ret),
TP_STRUCT__entry(
SCSB_TRACE_FIELDS
__field(__u64, ino)
__field(unsigned int, nlink)
__field(int, ret)
),
TP_fast_assign(
SCSB_TRACE_ASSIGN(sb);
__entry->ino = ino;
__entry->nlink = nlink;
__entry->ret = ret;
),
TP_printk(SCSBF" ino %llu nlink %u ret %d",
SCSB_TRACE_ARGS, __entry->ino, __entry->nlink, __entry->ret)
);
#endif /* _TRACE_SCOUTFS_H */
/* This part must be outside protection */

File diff suppressed because it is too large Load Diff

View File

@@ -56,19 +56,27 @@ do { \
__entry->name##_data_len, __entry->name##_cmd, __entry->name##_flags, \
__entry->name##_error
u64 scoutfs_server_reserved_meta_blocks(struct super_block *sb);
int scoutfs_server_lock_request(struct super_block *sb, u64 rid,
struct scoutfs_net_lock *nl);
int scoutfs_server_lock_response(struct super_block *sb, u64 rid, u64 id,
struct scoutfs_net_lock_grant_response *gr);
struct scoutfs_net_lock *nl);
int scoutfs_server_lock_recover_request(struct super_block *sb, u64 rid,
struct scoutfs_key *key);
void scoutfs_server_get_roots(struct super_block *sb,
struct scoutfs_net_roots *roots);
int scoutfs_server_hold_commit(struct super_block *sb);
void scoutfs_server_hold_commit(struct super_block *sb);
int scoutfs_server_apply_commit(struct super_block *sb, int err);
void scoutfs_server_recov_finish(struct super_block *sb, u64 rid, int which);
int scoutfs_server_send_omap_request(struct super_block *sb, u64 rid,
struct scoutfs_open_ino_map_args *args);
int scoutfs_server_send_omap_response(struct super_block *sb, u64 rid, u64 id,
struct scoutfs_open_ino_map *map, int err);
u64 scoutfs_server_seq(struct super_block *sb);
u64 scoutfs_server_next_seq(struct super_block *sb);
void scoutfs_server_set_seq_if_greater(struct super_block *sb, u64 seq);
struct sockaddr_in;
struct scoutfs_quorum_elected_info;
int scoutfs_server_start(struct super_block *sb, u64 term);
void scoutfs_server_abort(struct super_block *sb);
void scoutfs_server_stop(struct super_block *sb);

View File

@@ -989,12 +989,13 @@ int scoutfs_srch_rotate_log(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_btree_root *root,
struct scoutfs_srch_file *sfl)
struct scoutfs_srch_file *sfl, bool force)
{
struct scoutfs_key key;
int ret;
if (le64_to_cpu(sfl->blocks) < SCOUTFS_SRCH_LOG_BLOCK_LIMIT)
if (sfl->ref.blkno == 0 ||
(!force && le64_to_cpu(sfl->blocks) < SCOUTFS_SRCH_LOG_BLOCK_LIMIT))
return 0;
init_srch_key(&key, SCOUTFS_SRCH_LOG_TYPE,

View File

@@ -37,7 +37,7 @@ int scoutfs_srch_rotate_log(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,
struct scoutfs_btree_root *root,
struct scoutfs_srch_file *sfl);
struct scoutfs_srch_file *sfl, bool force);
int scoutfs_srch_get_compact(struct super_block *sb,
struct scoutfs_alloc *alloc,
struct scoutfs_block_writer *wri,

View File

@@ -44,6 +44,10 @@
#include "srch.h"
#include "item.h"
#include "alloc.h"
#include "recov.h"
#include "omap.h"
#include "volopt.h"
#include "fence.h"
#include "scoutfs_trace.h"
static struct dentry *scoutfs_debugfs_root;
@@ -166,7 +170,7 @@ out:
* try to free as many locks as possible.
*/
if (scoutfs_trigger(sb, STATFS_LOCK_PURGE))
scoutfs_free_unused_locks(sb, -1UL);
scoutfs_free_unused_locks(sb);
return ret;
}
@@ -243,28 +247,30 @@ static void scoutfs_put_super(struct super_block *sb)
trace_scoutfs_put_super(sb);
sbi->shutdown = true;
scoutfs_data_destroy(sb);
scoutfs_inode_stop(sb);
scoutfs_forest_stop(sb);
scoutfs_srch_destroy(sb);
scoutfs_unlock(sb, sbi->rid_lock, SCOUTFS_LOCK_WRITE);
sbi->rid_lock = NULL;
scoutfs_lock_shutdown(sb);
scoutfs_shutdown_trans(sb);
scoutfs_volopt_destroy(sb);
scoutfs_client_destroy(sb);
scoutfs_inode_destroy(sb);
scoutfs_item_destroy(sb);
scoutfs_forest_destroy(sb);
scoutfs_data_destroy(sb);
scoutfs_quorum_destroy(sb);
scoutfs_lock_shutdown(sb);
scoutfs_server_destroy(sb);
scoutfs_recov_destroy(sb);
scoutfs_net_destroy(sb);
scoutfs_lock_destroy(sb);
scoutfs_omap_destroy(sb);
scoutfs_block_destroy(sb);
scoutfs_destroy_triggers(sb);
scoutfs_fence_destroy(sb);
scoutfs_options_destroy(sb);
scoutfs_sysfs_destroy_attrs(sb, &sbi->mopts_ssa);
debugfs_remove(sbi->debug_root);
@@ -278,6 +284,21 @@ static void scoutfs_put_super(struct super_block *sb)
sb->s_fs_info = NULL;
}
/*
* Record that we're performing a forced unmount. As put_super drives
* destruction of the filesystem we won't issue more network or storage
* operations because we assume that they'll hang. Pending operations
* can return errors when it's possible to do so. We may be racing with
* pending operations which can't be canceled.
*/
static void scoutfs_umount_begin(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
scoutfs_warn(sb, "forcing unmount, can return errors and lose unsynced data");
sbi->forced_unmount = true;
}
static const struct super_operations scoutfs_super_ops = {
.alloc_inode = scoutfs_alloc_inode,
.drop_inode = scoutfs_drop_inode,
@@ -287,6 +308,7 @@ static const struct super_operations scoutfs_super_ops = {
.statfs = scoutfs_statfs,
.show_options = scoutfs_show_options,
.put_super = scoutfs_put_super,
.umount_begin = scoutfs_umount_begin,
};
/*
@@ -585,21 +607,24 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
scoutfs_sysfs_create_attrs(sb, &sbi->mopts_ssa,
mount_options_attrs, "mount_options") ?:
scoutfs_setup_triggers(sb) ?:
scoutfs_fence_setup(sb) ?:
scoutfs_block_setup(sb) ?:
scoutfs_forest_setup(sb) ?:
scoutfs_item_setup(sb) ?:
scoutfs_inode_setup(sb) ?:
scoutfs_data_setup(sb) ?:
scoutfs_setup_trans(sb) ?:
scoutfs_omap_setup(sb) ?:
scoutfs_lock_setup(sb) ?:
scoutfs_net_setup(sb) ?:
scoutfs_recov_setup(sb) ?:
scoutfs_server_setup(sb) ?:
scoutfs_quorum_setup(sb) ?:
scoutfs_client_setup(sb) ?:
scoutfs_lock_rid(sb, SCOUTFS_LOCK_WRITE, 0, sbi->rid,
&sbi->rid_lock) ?:
scoutfs_volopt_setup(sb) ?:
scoutfs_trans_get_log_trees(sb) ?:
scoutfs_srch_setup(sb);
scoutfs_srch_setup(sb) ?:
scoutfs_inode_start(sb);
if (ret)
goto out;
@@ -620,7 +645,6 @@ static int scoutfs_fill_super(struct super_block *sb, void *data, int silent)
goto out;
scoutfs_trans_restart_sync_deadline(sb);
// scoutfs_scan_orphans(sb);
ret = 0;
out:
/* on error, generic_shutdown_super calls put_super if s_root */
@@ -643,6 +667,9 @@ static void scoutfs_kill_sb(struct super_block *sb)
{
trace_scoutfs_kill_sb(sb);
if (SCOUTFS_HAS_SBI(sb))
scoutfs_lock_unmount_begin(sb);
kill_block_super(sb);
}

View File

@@ -26,13 +26,16 @@ struct net_info;
struct block_info;
struct forest_info;
struct srch_info;
struct recov_info;
struct omap_info;
struct volopt_info;
struct fence_info;
struct scoutfs_sb_info {
struct super_block *sb;
/* assigned once at the start of each mount, read-only */
u64 rid;
struct scoutfs_lock *rid_lock;
struct scoutfs_super_block super;
@@ -48,7 +51,10 @@ struct scoutfs_sb_info {
struct block_info *block_info;
struct forest_info *forest_info;
struct srch_info *srch_info;
struct omap_info *omap_info;
struct volopt_info *volopt_info;
struct item_cache_info *item_cache_info;
struct fence_info *fence_info;
wait_queue_head_t trans_hold_wq;
struct task_struct *trans_task;
@@ -70,6 +76,7 @@ struct scoutfs_sb_info {
struct lock_server_info *lock_server_info;
struct client_info *client_info;
struct server_info *server_info;
struct recov_info *recov_info;
struct sysfs_info *sfsinfo;
struct scoutfs_counters *counters;
@@ -81,7 +88,7 @@ struct scoutfs_sb_info {
struct dentry *debug_root;
bool shutdown;
bool forced_unmount;
unsigned long corruption_messages_once[SC_NR_LONGS];
};
@@ -103,6 +110,13 @@ static inline bool SCOUTFS_IS_META_BDEV(struct scoutfs_super_block *super_block)
#define SCOUTFS_META_BDEV_MODE (FMODE_READ | FMODE_WRITE | FMODE_EXCL)
static inline bool scoutfs_forcing_unmount(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
return sbi->forced_unmount;
}
/*
* A small string embedded in messages that's used to identify a
* specific mount. It's the three most significant bytes of the fsid

View File

@@ -131,9 +131,10 @@ void scoutfs_sysfs_init_attrs(struct super_block *sb,
* If this returns success then the file will be visible and show can
* be called until unmount.
*/
int scoutfs_sysfs_create_attrs(struct super_block *sb,
struct scoutfs_sysfs_attrs *ssa,
struct attribute **attrs, char *fmt, ...)
int scoutfs_sysfs_create_attrs_parent(struct super_block *sb,
struct kobject *parent,
struct scoutfs_sysfs_attrs *ssa,
struct attribute **attrs, char *fmt, ...)
{
va_list args;
size_t name_len;
@@ -174,8 +175,8 @@ int scoutfs_sysfs_create_attrs(struct super_block *sb,
goto out;
}
ret = kobject_init_and_add(&ssa->kobj, &ssa->ktype,
scoutfs_sysfs_sb_dir(sb), "%s", ssa->name);
ret = kobject_init_and_add(&ssa->kobj, &ssa->ktype, parent,
"%s", ssa->name);
out:
if (ret) {
kfree(ssa->name);

View File

@@ -10,6 +10,8 @@
#define SCOUTFS_ATTR_RO(_name) \
static struct kobj_attribute scoutfs_attr_##_name = __ATTR_RO(_name)
#define SCOUTFS_ATTR_RW(_name) \
static struct kobj_attribute scoutfs_attr_##_name = __ATTR_RW(_name)
#define SCOUTFS_ATTR_PTR(_name) \
&scoutfs_attr_##_name.attr
@@ -34,9 +36,14 @@ struct scoutfs_sysfs_attrs {
void scoutfs_sysfs_init_attrs(struct super_block *sb,
struct scoutfs_sysfs_attrs *ssa);
int scoutfs_sysfs_create_attrs(struct super_block *sb,
struct scoutfs_sysfs_attrs *ssa,
struct attribute **attrs, char *fmt, ...);
int scoutfs_sysfs_create_attrs_parent(struct super_block *sb,
struct kobject *parent,
struct scoutfs_sysfs_attrs *ssa,
struct attribute **attrs, char *fmt, ...);
#define scoutfs_sysfs_create_attrs(sb, ssa, attrs, fmt, args...) \
scoutfs_sysfs_create_attrs_parent(sb, scoutfs_sysfs_sb_dir(sb), \
ssa, attrs, fmt, ##args)
void scoutfs_sysfs_destroy_attrs(struct super_block *sb,
struct scoutfs_sysfs_attrs *ssa);

View File

@@ -185,6 +185,11 @@ void scoutfs_trans_write_func(struct work_struct *work)
wait_event(sbi->trans_hold_wq, drained_holders(tri));
if (scoutfs_forcing_unmount(sb)) {
ret = -EIO;
goto out;
}
trace_scoutfs_trans_write_func(sb,
scoutfs_block_writer_dirty_bytes(sb, &tri->wri));
@@ -202,7 +207,7 @@ void scoutfs_trans_write_func(struct work_struct *work)
if (ret < 0)
s = "clean advance seq";
}
goto out;
goto err;
}
if (sbi->trans_deadline_expired)
@@ -222,11 +227,12 @@ void scoutfs_trans_write_func(struct work_struct *work)
scoutfs_item_write_done(sb) ?:
(s = "advance seq", scoutfs_client_advance_seq(sb, &trans_seq)) ?:
(s = "get log trees", scoutfs_trans_get_log_trees(sb));
out:
err:
if (ret < 0)
scoutfs_err(sb, "critical transaction commit failure: %s, %d",
s, ret);
out:
spin_lock(&sbi->trans_write_lock);
sbi->trans_write_count++;
sbi->trans_write_ret = ret;
@@ -430,8 +436,8 @@ static bool commit_before_hold(struct super_block *sb, struct trans_info *tri)
return true;
}
/* Try to refill data allocator before premature enospc */
if (scoutfs_data_alloc_free_bytes(sb) <= SCOUTFS_TRANS_DATA_ALLOC_LWM) {
/* if we're low and can't refill then alloc could empty and return enospc */
if (scoutfs_data_alloc_should_refill(sb, SCOUTFS_ALLOC_DATA_REFILL_THRESH)) {
scoutfs_inc_counter(sb, trans_commit_data_alloc_low);
return true;
}
@@ -439,38 +445,15 @@ static bool commit_before_hold(struct super_block *sb, struct trans_info *tri)
return false;
}
static bool acquired_hold(struct super_block *sb)
/*
* called as a wait_event condition, needs to be careful to not change
* task state and is racing with waking paths that sub_return, test, and
* wake.
*/
static bool holders_no_writer(struct trans_info *tri)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
DECLARE_TRANS_INFO(sb, tri);
bool acquired;
/* if a caller already has a hold we acquire unconditionally */
if (inc_journal_info_holders()) {
atomic_inc(&tri->holders);
acquired = true;
goto out;
}
/* wait if the writer is blocking holds */
if (!inc_holders_unless_writer(tri)) {
dec_journal_info_holders();
acquired = false;
goto out;
}
/* wait if we're triggering another commit */
if (commit_before_hold(sb, tri)) {
release_holders(sb);
queue_trans_work(sbi);
acquired = false;
goto out;
}
trace_scoutfs_trans_acquired_hold(sb, current->journal_info, atomic_read(&tri->holders));
acquired = true;
out:
return acquired;
smp_mb(); /* make sure task in wait_event queue before atomic read */
return !(atomic_read(&tri->holders) & TRANS_HOLDERS_WRITE_FUNC_BIT);
}
/*
@@ -486,15 +469,64 @@ out:
* The writing thread marks itself as a global trans_task which
* short-circuits all the hold machinery so it can call code that would
* otherwise try to hold transactions while it is writing.
*
* If the caller is adding metadata items that will eventually consume
* free space -- not dirtying existing items or adding deletion items --
* then we can return enospc if our metadata allocator indicates that
* we're low on space.
*/
int scoutfs_hold_trans(struct super_block *sb)
int scoutfs_hold_trans(struct super_block *sb, bool allocing)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
DECLARE_TRANS_INFO(sb, tri);
u64 seq;
int ret;
if (current == sbi->trans_task)
return 0;
return wait_event_interruptible(sbi->trans_hold_wq, acquired_hold(sb));
for (;;) {
/* if a caller already has a hold we acquire unconditionally */
if (inc_journal_info_holders()) {
atomic_inc(&tri->holders);
ret = 0;
break;
}
/* wait until the writer work is finished */
if (!inc_holders_unless_writer(tri)) {
dec_journal_info_holders();
ret = wait_event_interruptible(sbi->trans_hold_wq, holders_no_writer(tri));
if (ret < 0)
break;
continue;
}
/* return enospc if server is into reserved blocks and we're allocating */
if (allocing && scoutfs_alloc_test_flag(sb, &tri->alloc, SCOUTFS_ALLOC_FLAG_LOW)) {
release_holders(sb);
ret = -ENOSPC;
break;
}
/* see if we need to trigger and wait for a commit before holding */
if (commit_before_hold(sb, tri)) {
seq = scoutfs_trans_sample_seq(sb);
release_holders(sb);
queue_trans_work(sbi);
ret = wait_event_interruptible(sbi->trans_hold_wq,
scoutfs_trans_sample_seq(sb) != seq);
if (ret < 0)
break;
continue;
}
ret = 0;
break;
}
trace_scoutfs_hold_trans(sb, current->journal_info, atomic_read(&tri->holders), ret);
return ret;
}
/*
@@ -519,7 +551,7 @@ void scoutfs_release_trans(struct super_block *sb)
release_holders(sb);
trace_scoutfs_release_trans(sb, current->journal_info, atomic_read(&tri->holders));
trace_scoutfs_release_trans(sb, current->journal_info, atomic_read(&tri->holders), 0);
}
/*
@@ -564,8 +596,15 @@ int scoutfs_setup_trans(struct super_block *sb)
}
/*
* kill_sb calls sync before getting here so we know that dirty data
* should be in flight. We just have to wait for it to quiesce.
* While the vfs will have done an fs level sync before calling
* put_super, we may have done work down in our level after all the fs
* ops were done. An example is final inode deletion in iput, that's
* done in generic_shutdown_super after the sync and before calling our
* put_super.
*
* So we always try to write any remaining dirty transactions before
* shutting down. Typically there won't be any dirty data and the
* worker will just return.
*/
void scoutfs_shutdown_trans(struct super_block *sb)
{
@@ -573,13 +612,18 @@ void scoutfs_shutdown_trans(struct super_block *sb)
DECLARE_TRANS_INFO(sb, tri);
if (tri) {
scoutfs_block_writer_forget_all(sb, &tri->wri);
if (sbi->trans_write_workq) {
/* immediately queues pending timer */
flush_delayed_work(&sbi->trans_write_work);
/* prevents re-arming if it has to wait */
cancel_delayed_work_sync(&sbi->trans_write_work);
destroy_workqueue(sbi->trans_write_workq);
/* trans work schedules after shutdown see null */
sbi->trans_write_workq = NULL;
}
scoutfs_block_writer_forget_all(sb, &tri->wri);
kfree(tri);
sbi->trans_info = NULL;
}

View File

@@ -1,18 +1,13 @@
#ifndef _SCOUTFS_TRANS_H_
#define _SCOUTFS_TRANS_H_
/* the server will attempt to fill data allocs for each trans */
#define SCOUTFS_TRANS_DATA_ALLOC_HWM (2ULL * 1024 * 1024 * 1024)
/* the client will force commits if data allocators get too low */
#define SCOUTFS_TRANS_DATA_ALLOC_LWM (256ULL * 1024 * 1024)
void scoutfs_trans_write_func(struct work_struct *work);
int scoutfs_trans_sync(struct super_block *sb, int wait);
int scoutfs_file_fsync(struct file *file, loff_t start, loff_t end,
int datasync);
void scoutfs_trans_restart_sync_deadline(struct super_block *sb);
int scoutfs_hold_trans(struct super_block *sb);
int scoutfs_hold_trans(struct super_block *sb, bool allocing);
bool scoutfs_trans_held(void);
void scoutfs_release_trans(struct super_block *sb);
u64 scoutfs_trans_sample_seq(struct super_block *sb);

188
kmod/src/volopt.c Normal file
View File

@@ -0,0 +1,188 @@
/*
* Copyright (C) 2021 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/kobject.h>
#include <linux/sysfs.h>
#include "super.h"
#include "client.h"
#include "volopt.h"
/*
* Volume options are exposed through a sysfs directory. Getting and
* setting the values sends rpcs to the server who owns the options in
* the super block.
*/
struct volopt_info {
struct super_block *sb;
struct scoutfs_sysfs_attrs ssa;
};
#define DECLARE_VOLOPT_INFO(sb, name) \
struct volopt_info *name = SCOUTFS_SB(sb)->volopt_info
#define DECLARE_VOLOPT_INFO_KOBJ(kobj, name) \
DECLARE_VOLOPT_INFO(SCOUTFS_SYSFS_ATTRS_SB(kobj), name)
/*
* attribute arrays need to be dense but the options we export could
* well become sparse over time. .store and .load are generic and we
* have a lookup table to map the attributes array indexes to the number
* and name of the option.
*/
static struct volopt_nr_name {
int nr;
char *name;
} volopt_table[] = {
{ SCOUTFS_VOLOPT_DATA_ALLOC_ZONE_BLOCKS_NR, "data_alloc_zone_blocks" },
};
/* initialized by setup, pointer array is null terminated */
static struct kobj_attribute volopt_attrs[ARRAY_SIZE(volopt_table)];
static struct attribute *volopt_attr_ptrs[ARRAY_SIZE(volopt_table) + 1];
static void get_opt_data(struct kobj_attribute *attr, struct scoutfs_volume_options *volopt,
u64 *bit, __le64 **opt)
{
size_t index = attr - &volopt_attrs[0];
int nr = volopt_table[index].nr;
*bit = 1ULL << nr;
*opt = &volopt->set_bits + 1 + nr;
}
static ssize_t volopt_attr_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
{
DECLARE_VOLOPT_INFO_KOBJ(kobj, vinf);
struct super_block *sb = vinf->sb;
struct scoutfs_volume_options volopt;
__le64 *opt;
u64 bit;
int ret;
ret = scoutfs_client_get_volopt(sb, &volopt);
if (ret < 0)
return ret;
get_opt_data(attr, &volopt, &bit, &opt);
if (le64_to_cpu(volopt.set_bits) & bit) {
return snprintf(buf, PAGE_SIZE, "%llu", le64_to_cpup(opt));
} else {
buf[0] = '\0';
return 0;
}
}
static ssize_t volopt_attr_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
{
DECLARE_VOLOPT_INFO_KOBJ(kobj, vinf);
struct super_block *sb = vinf->sb;
struct scoutfs_volume_options volopt = {0,};
u8 chars[32];
__le64 *opt;
u64 bit;
u64 val;
int ret;
if (count == 0)
return 0;
if (count > sizeof(chars) - 1)
return -ERANGE;
get_opt_data(attr, &volopt, &bit, &opt);
if (buf[0] == '\n' || buf[0] == '\r') {
volopt.set_bits = cpu_to_le64(bit);
ret = scoutfs_client_clear_volopt(sb, &volopt);
} else {
memcpy(chars, buf, count);
chars[count] = '\0';
ret = kstrtoull(chars, 0, &val);
if (ret < 0)
return ret;
volopt.set_bits = cpu_to_le64(bit);
*opt = cpu_to_le64(val);
ret = scoutfs_client_set_volopt(sb, &volopt);
}
if (ret == 0)
ret = count;
return ret;
}
/*
* The volume option sysfs files are slim shims around RPCs so this
* should be called after the client is setup and before it is torn
* down.
*/
int scoutfs_volopt_setup(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct volopt_info *vinf;
int ret;
int i;
/* persistent volume options are always a bitmap u64 then the 64 options */
BUILD_BUG_ON(sizeof(struct scoutfs_volume_options) != (1 + 64) * 8);
vinf = kzalloc(sizeof(struct volopt_info), GFP_KERNEL);
if (!vinf) {
ret = -ENOMEM;
goto out;
}
scoutfs_sysfs_init_attrs(sb, &vinf->ssa);
vinf->sb = sb;
sbi->volopt_info = vinf;
for (i = 0; i < ARRAY_SIZE(volopt_table); i++) {
volopt_attrs[i] = (struct kobj_attribute) {
.attr = { .name = volopt_table[i].name, .mode = S_IWUSR | S_IRUGO },
.show = volopt_attr_show,
.store = volopt_attr_store,
};
volopt_attr_ptrs[i] = &volopt_attrs[i].attr;
}
BUILD_BUG_ON(ARRAY_SIZE(volopt_table) != ARRAY_SIZE(volopt_attr_ptrs) - 1);
volopt_attr_ptrs[i] = NULL;
ret = scoutfs_sysfs_create_attrs(sb, &vinf->ssa, volopt_attr_ptrs, "volume_options");
if (ret < 0)
goto out;
out:
if (ret)
scoutfs_volopt_destroy(sb);
return ret;
}
void scoutfs_volopt_destroy(struct super_block *sb)
{
struct scoutfs_sb_info *sbi = SCOUTFS_SB(sb);
struct volopt_info *vinf = SCOUTFS_SB(sb)->volopt_info;
if (vinf) {
scoutfs_sysfs_destroy_attrs(sb, &vinf->ssa);
kfree(vinf);
sbi->volopt_info = NULL;
}
}

7
kmod/src/volopt.h Normal file
View File

@@ -0,0 +1,7 @@
#ifndef _SCOUTFS_VOLOPT_H_
#define _SCOUTFS_VOLOPT_H_
int scoutfs_volopt_setup(struct super_block *sb);
void scoutfs_volopt_destroy(struct super_block *sb);
#endif

View File

@@ -577,7 +577,7 @@ static int scoutfs_xattr_set(struct dentry *dentry, const char *name,
retry:
ret = scoutfs_inode_index_start(sb, &ind_seq) ?:
scoutfs_inode_index_prepare(sb, &ind_locks, inode, false) ?:
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq);
scoutfs_inode_index_try_lock_hold(sb, &ind_locks, ind_seq, true);
if (ret > 0)
goto retry;
if (ret)
@@ -778,7 +778,7 @@ int scoutfs_xattr_drop(struct super_block *sb, u64 ino,
&tgs) != 0)
memset(&tgs, 0, sizeof(tgs));
ret = scoutfs_hold_trans(sb);
ret = scoutfs_hold_trans(sb, false);
if (ret < 0)
break;
release = true;

2
tests/.gitignore vendored
View File

@@ -4,3 +4,5 @@ src/dumb_setxattr
src/handle_cat
src/bulk_create_paths
src/find_xattrs
src/stage_tmpfile
src/create_xattr_loop

View File

@@ -1,4 +1,4 @@
CFLAGS := -Wall -O2 -Werror -D_FILE_OFFSET_BITS=64 -fno-strict-aliasing
CFLAGS := -Wall -O2 -Werror -D_FILE_OFFSET_BITS=64 -fno-strict-aliasing -I ../kmod/src
SHELL := /usr/bin/bash
# each binary command is built from a single .c file
@@ -6,7 +6,9 @@ BIN := src/createmany \
src/dumb_setxattr \
src/handle_cat \
src/bulk_create_paths \
src/find_xattrs
src/stage_tmpfile \
src/find_xattrs \
src/create_xattr_loop
DEPS := $(wildcard src/*.d)

View File

@@ -52,8 +52,8 @@ t_filter_dmesg()
# tests that drop unmount io triggers fencing
re="$re|scoutfs .* error: fencing "
re="$re|scoutfs .*: waiting for .* lock clients"
re="$re|scoutfs .*: all lock clients recovered"
re="$re|scoutfs .*: waiting for .* clients"
re="$re|scoutfs .*: all clients recovered"
re="$re|scoutfs .* error: client rid.*lock recovery timed out"
# some tests mount w/o options
@@ -62,5 +62,16 @@ t_filter_dmesg()
# in debugging kernels we can slow things down a bit
re="$re|hrtimer: interrupt took .*"
# fencing tests force unmounts and trigger timeouts
re="$re|scoutfs .* forcing unmount"
re="$re|scoutfs .* reconnect timed out"
re="$re|scoutfs .* recovery timeout expired"
re="$re|scoutfs .* fencing previous leader"
re="$re|scoutfs .* reclaimed resources"
re="$re|scoutfs .* quorum .* error"
re="$re|scoutfs .* error reading quorum block"
re="$re|scoutfs .* error .* writing quorum block"
re="$re|scoutfs .* error .* while checking to delete inode"
egrep -v "($re)"
}

View File

@@ -17,6 +17,17 @@ t_sync_seq_index()
t_quiet sync
}
t_mount_rid()
{
local nr="${1:-0}"
local mnt="$(eval echo \$T_M$nr)"
local rid
rid=$(scoutfs statfs -s rid -p "$mnt")
echo "$rid"
}
#
# Output the "f.$fsid.r.$rid" identifier string for the given mount
# number, 0 is used by default if none is specified.
@@ -129,7 +140,17 @@ t_umount()
test "$nr" -lt "$T_NR_MOUNTS" || \
t_fail "fs nr $nr invalid"
eval t_quiet umount \$T_M$i
eval t_quiet umount \$T_M$nr
}
t_force_umount()
{
local nr="$1"
test "$nr" -lt "$T_NR_MOUNTS" || \
t_fail "fs nr $nr invalid"
eval t_quiet umount -f \$T_M$nr
}
#
@@ -277,3 +298,67 @@ t_counter_diff_changed() {
echo "counter $which didn't change" ||
echo "counter $which changed"
}
#
# See if we can find a local mount with the caller's rid.
#
t_rid_is_mounted() {
local rid="$1"
local fr="$1"
for fr in /sys/fs/scoutfs/*; do
if [ "$(cat $fr/rid)" == "$rid" ]; then
return 0
fi
done
return 1
}
#
# A given mount is being fenced if any mount has a fence request pending
# for it which hasn't finished and been removed.
#
t_rid_is_fencing() {
local rid="$1"
local fr
for fr in /sys/fs/scoutfs/*; do
if [ -d "$fr/fence/$rid" ]; then
return 0
fi
done
return 1
}
#
# Wait until the mount identified by the first rid arg is not in any
# states specified by the remaining state description word args.
#
t_wait_if_rid_is() {
local rid="$1"
while ( [[ $* =~ mounted ]] && t_rid_is_mounted $rid ) ||
( [[ $* =~ fencing ]] && t_rid_is_fencing $rid ) ; do
sleep .5
done
}
#
# Wait until any mount identifies itself as the elected leader. We can
# be waiting while tests mount and unmount so mounts may not be mounted
# at the test's expected mount points.
#
t_wait_for_leader() {
local i
while sleep .25; do
for i in $(t_fs_nrs); do
local ldr="$(t_sysfs_path $i 2>/dev/null)/quorum/is_leader"
if [ "$(cat $ldr 2>/dev/null)" == "1" ]; then
return
fi
done
done
}

8
tests/golden/enospc Normal file
View File

@@ -0,0 +1,8 @@
== prepare directories and files
== fallocate until enospc
== remove all the files and verify free data blocks
== make small meta fs
== create large xattrs until we fill up metadata
== remove files with xattrs after enospc
== make sure we can create again
== cleanup small meta fs

View File

View File

@@ -0,0 +1,5 @@
== make sure all mounts can see each other
== force unmount one client, connection timeout, fence nop, mount
== force unmount all non-server, connection timeout, fence nop, mount
== force unmount server, quorum elects new leader, fence nop, mount
== force unmount everything, new server fences all previous

View File

@@ -0,0 +1,27 @@
== basic unlink deletes
ino found in dseq index
ino not found in dseq index
== local open-unlink waits for close to delete
contents after rm: contents
ino found in dseq index
ino not found in dseq index
== multiple local opens are protected
contents after rm 1: contents
contents after rm 2: contents
ino found in dseq index
ino not found in dseq index
== remote unopened unlink deletes
ino not found in dseq index
ino not found in dseq index
== unlink wait for open on other mount
mount 0 contents after mount 1 rm: contents
ino found in dseq index
ino found in dseq index
stat: cannot stat /mnt/test/test/inode-deletion/file: No such file or directory
ino not found in dseq index
ino not found in dseq index
== lots of deletions use one open map
== open files survive remote scanning orphans
mount 0 contents after mount 1 remounted: contents
ino not found in dseq index
ino not found in dseq index

View File

View File

@@ -0,0 +1,4 @@
== test our inode existance function
== unlinked and opened inodes still exist
== orphan from failed evict deletion is picked up
== orphaned inos in all mounts all deleted

View File

@@ -0,0 +1,18 @@
total file size 33669120
00000000 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 |AAAAAAAAAAAAAAAA|
*
00400000 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 |BBBBBBBBBBBBBBBB|
*
00801000 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 43 |CCCCCCCCCCCCCCCC|
*
00c03000 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 |DDDDDDDDDDDDDDDD|
*
01006000 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 45 |EEEEEEEEEEEEEEEE|
*
0140a000 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 46 |FFFFFFFFFFFFFFFF|
*
0180f000 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 47 |GGGGGGGGGGGGGGGG|
*
01c15000 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 |HHHHHHHHHHHHHHHH|
*
0201c000

View File

@@ -1,6 +1,7 @@
Ran:
generic/001
generic/002
generic/004
generic/005
generic/006
generic/007
@@ -73,7 +74,6 @@ generic/376
generic/377
Not
run:
generic/004
generic/008
generic/009
generic/012
@@ -278,4 +278,4 @@ shared/004
shared/032
shared/051
shared/289
Passed all 72 tests
Passed all 73 tests

View File

@@ -18,10 +18,15 @@ die() {
exit 1
}
timestamp()
{
date '+%F %T.%N'
}
# output a message with a timestamp to the run.log
log()
{
echo "[$(date '+%F %T.%N')] $*" >> "$T_RESULTS/run.log"
echo "[$(timestamp)] $*" >> "$T_RESULTS/run.log"
}
# run a logged command, exiting if it fails
@@ -66,6 +71,7 @@ $(basename $0) options:
-X | xfstests git repo. Used by tests/xfstests.sh.
-x | xfstests git branch to checkout and track.
-y | xfstests ./check additional args
-z <nr> | set data-alloc-zone-blocks in mkfs
EOF
}
@@ -169,6 +175,11 @@ while true; do
T_XFSTESTS_ARGS="$2"
shift
;;
-z)
test -n "$2" || die "-z must have nr mounts argument"
T_DATA_ALLOC_ZONE_BLOCKS="-z $2"
shift
;;
-h|-\?|--help)
show_help
exit 1
@@ -319,7 +330,8 @@ if [ -n "$T_MKFS" ]; then
done
msg "making new filesystem with $T_QUORUM quorum members"
cmd scoutfs mkfs -f $quo "$T_META_DEVICE" "$T_DATA_DEVICE"
cmd scoutfs mkfs -f $quo $T_DATA_ALLOC_ZONE_BLOCKS \
"$T_META_DEVICE" "$T_DATA_DEVICE"
fi
if [ -n "$T_INSMOD" ]; then
@@ -360,6 +372,39 @@ cmd cat /sys/kernel/debug/tracing/set_event
cmd grep . /sys/kernel/debug/tracing/options/trace_printk \
/proc/sys/kernel/ftrace_dump_on_oops
#
# Build a fenced config that runs scripts out of the repository rather
# than the default system directory
#
conf="$T_RESULTS/scoutfs-fencd.conf"
cat > $conf << EOF
SCOUTFS_FENCED_DELAY=1
SCOUTFS_FENCED_RUN=$T_UTILS/fenced/local-force-unmount
SCOUTFS_FENCED_RUN_ARGS=""
EOF
export SCOUTFS_FENCED_CONFIG_FILE="$conf"
#
# Run the agent in the background, log its output, an kill it if we
# exit
#
fenced_log()
{
echo "[$(timestamp)] $*" >> "$T_RESULTS/fenced.stdout.log"
}
fenced_pid=""
kill_fenced()
{
if test -n "$fenced_pid" -a -d "/proc/$fenced_pid" ; then
fenced_log "killing fenced pid $fenced_pid"
kill "$fenced_pid"
fi
}
trap kill_fenced EXIT
$T_UTILS/fenced/scoutfs-fenced > "$T_RESULTS/fenced.stdout.log" 2> "$T_RESULTS/fenced.stderr.log" &
fenced_pid=$!
fenced_log "started fenced pid $fenced_pid in the background"
#
# mount concurrently so that a quorum is present to elect the leader and
# start a server.

View File

@@ -7,26 +7,33 @@ simple-release-extents.sh
setattr_more.sh
offline-extent-waiting.sh
move-blocks.sh
enospc.sh
srch-basic-functionality.sh
simple-xattr-unit.sh
lock-refleak.sh
lock-shrink-consistency.sh
lock-pr-cw-conflict.sh
lock-revoke-getcwd.sh
export-lookup-evict-race.sh
createmany-parallel.sh
createmany-large-names.sh
createmany-rename-large-dir.sh
stage-release-race-alloc.sh
stage-multi-part.sh
stage-tmpfile.sh
basic-posix-consistency.sh
dirent-consistency.sh
mkdir-rename-rmdir.sh
lock-ex-race-processes.sh
lock-conflicting-batch-commit.sh
cross-mount-data-free.sh
persistent-item-vers.sh
setup-error-teardown.sh
fence-and-reclaim.sh
orphan-inodes.sh
mount-unmount-race.sh
createmany-parallel-mounts.sh
archive-light-cycle.sh
block-stale-reads.sh
inode-deletion.sh
xfstests.sh

View File

@@ -0,0 +1,113 @@
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/xattr.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <limits.h>
static void exit_usage(void)
{
printf(" -h/-? output this usage message and exit\n"
" -c <count> number of xattrs to create\n"
" -n <string> xattr name prefix, -NR is appended\n"
" -p <path> string with path to file with xattrs\n"
" -s <size> xattr value size\n");
exit(1);
}
int main(int argc, char **argv)
{
char *pref = NULL;
char *path = NULL;
char *val;
char *name;
unsigned long long count = 0;
unsigned long long size = 0;
unsigned long long i;
int ret;
int c;
while ((c = getopt(argc, argv, "+c:n:p:s:")) != -1) {
switch (c) {
case 'c':
count = strtoull(optarg, NULL, 0);
break;
case 'n':
pref = strdup(optarg);
break;
case 'p':
path = strdup(optarg);
break;
case 's':
size = strtoull(optarg, NULL, 0);
break;
case '?':
printf("unknown argument: %c\n", optind);
case 'h':
exit_usage();
}
}
if (count == 0) {
printf("specify count of xattrs to create with -c\n");
exit(1);
}
if (count == ULLONG_MAX) {
printf("invalid -c count\n");
exit(1);
}
if (size == 0) {
printf("specify xattrs value size with -s\n");
exit(1);
}
if (size == ULLONG_MAX || size < 2) {
printf("invalid -s size\n");
exit(1);
}
if (path == NULL) {
printf("specify path to file with -p\n");
exit(1);
}
if (pref == NULL) {
printf("specify xattr name prefix string with -n\n");
exit(1);
}
ret = snprintf(NULL, 0, "%s-%llu", pref, ULLONG_MAX) + 1;
name = malloc(ret);
if (!name) {
printf("couldn't allocate xattr name buffer\n");
exit(1);
}
val = malloc(size);
if (!val) {
printf("couldn't allocate xattr value buffer\n");
exit(1);
}
memset(val, 'a', size - 1);
val[size - 1] = '\0';
for (i = 0; i < count; i++) {
sprintf(name, "%s-%llu", pref, i);
ret = setxattr(path, name, val, size, 0);
if (ret) {
printf("returned %d errno %d (%s)\n",
ret, errno, strerror(errno));
return 1;
}
}
return 0;
}

145
tests/src/stage_tmpfile.c Normal file
View File

@@ -0,0 +1,145 @@
/*
* Exercise O_TMPFILE creation as well as staging from tmpfiles into
* a released destination file.
*
* Copyright (C) 2021 Versity Software, Inc. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License v2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <errno.h>
#include <linux/types.h>
#include <assert.h>
#include "ioctl.h"
#define array_size(arr) (sizeof(arr) / sizeof(arr[0]))
/*
* Write known data into 8 tmpfiles.
* Make a new file X and release it
* Move contents of 8 tmpfiles into X.
*/
struct sub_tmp_info {
int fd;
unsigned int offset;
unsigned int length;
};
#define SZ 4096
char buf[SZ];
int main(int argc, char **argv)
{
struct scoutfs_ioctl_release ioctl_args = {0};
struct scoutfs_ioctl_move_blocks mb;
struct sub_tmp_info sub_tmps[8];
int tot_size = 0;
char *dest_file;
int dest_fd;
char *mnt;
int ret;
int i;
if (argc < 3) {
printf("%s <mountpoint> <dest_file>\n", argv[0]);
return 1;
}
mnt = argv[1];
dest_file = argv[2];
for (i = 0; i < array_size(sub_tmps); i++) {
struct sub_tmp_info *sub_tmp = &sub_tmps[i];
int remaining;
sub_tmp->fd = open(mnt, O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR);
if (sub_tmp->fd < 0) {
perror("error");
exit(1);
}
sub_tmp->offset = tot_size;
/* First tmp file is 4MB */
/* Each is 4k bigger than last */
sub_tmp->length = (i + 1024) * sizeof(buf);
remaining = sub_tmp->length;
/* Each sub tmpfile written with 'A', 'B', etc. */
memset(buf, 'A' + i, sizeof(buf));
while (remaining) {
int written;
written = write(sub_tmp->fd, buf, sizeof(buf));
assert(written == sizeof(buf));
tot_size += sizeof(buf);
remaining -= written;
}
}
printf("total file size %d\n", tot_size);
dest_fd = open(dest_file, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
if (dest_fd == -1) {
perror("error");
exit(1);
}
// make dest file big
ret = posix_fallocate(dest_fd, 0, tot_size);
if (ret) {
perror("error");
exit(1);
}
// release everything in dest file
ioctl_args.offset = 0;
ioctl_args.length = tot_size;
ioctl_args.data_version = 0;
ret = ioctl(dest_fd, SCOUTFS_IOC_RELEASE, &ioctl_args);
if (ret < 0) {
perror("error");
exit(1);
}
// move contents into dest in reverse order
for (i = array_size(sub_tmps) - 1; i >= 0 ; i--) {
struct sub_tmp_info *sub_tmp = &sub_tmps[i];
mb.from_fd = sub_tmp->fd;
mb.from_off = 0;
mb.len = sub_tmp->length;
mb.to_off = sub_tmp->offset;
mb.data_version = 0;
mb.flags = SCOUTFS_IOC_MB_STAGE;
ret = ioctl(dest_fd, SCOUTFS_IOC_MOVE_BLOCKS, &mb);
if (ret < 0) {
perror("error");
exit(1);
}
}
return 0;
}

100
tests/tests/enospc.sh Normal file
View File

@@ -0,0 +1,100 @@
#
# test hititng enospc by filling with data or metadata and
# then recovering by removing what we filled.
#
# Type Size Total Used Free Use%
#MetaData 64KB 1048576 32782 1015794 3
# Data 4KB 16777152 0 16777152 0
free_blocks() {
local md="$1"
local mnt="$2"
scoutfs df -p "$mnt" | awk '($1 == "'$md'") { print $5; exit }'
}
t_require_commands scoutfs stat fallocate createmany
echo "== prepare directories and files"
for n in $(t_fs_nrs); do
eval path="\$T_D${n}/dir-$n/file-$n"
mkdir -p $(dirname $path)
touch $path
done
sync
echo "== fallocate until enospc"
before=$(free_blocks Data "$T_M0")
finished=0
while [ $finished != 1 ]; do
for n in $(t_fs_nrs); do
eval path="\$T_D${n}/dir-$n/file-$n"
off=$(stat -c "%s" "$path")
LC_ALL=C fallocate -o $off -l 128MiB "$path" > $T_TMP.fallocate 2>&1
err="$?"
if grep -qi "no space" $T_TMP.fallocate; then
finished=1
break
fi
if [ "$err" != "0" ]; then
t_fail "fallocate failed with $err"
fi
done
done
echo "== remove all the files and verify free data blocks"
for n in $(t_fs_nrs); do
eval dir="\$T_D${n}/dir-$n"
rm -rf "$dir"
done
sync
after=$(free_blocks Data "$T_M0")
# nothing else should be modifying data blocks
test "$before" == "$after" || \
t_fail "$after free data blocks after rm, expected $before"
# XXX this is all pretty manual, would be nice to have helpers
echo "== make small meta fs"
# meta device just big enough for reserves and the metadata we'll fill
scoutfs mkfs -A -f -Q 0,127.0.0.1,53000 -m 10G "$T_EX_META_DEV" "$T_EX_DATA_DEV" > $T_TMP.mkfs.out 2>&1 || \
t_fail "mkfs failed"
SCR="/mnt/scoutfs.enospc"
mkdir -p "$SCR"
mount -t scoutfs -o metadev_path=$T_EX_META_DEV,quorum_slot_nr=0 \
"$T_EX_DATA_DEV" "$SCR"
echo "== create large xattrs until we fill up metadata"
mkdir -p "$SCR/xattrs"
for f in $(seq 1 100000); do
file="$SCR/xattrs/file-$f"
touch "$file"
LC_ALL=C create_xattr_loop -c 1000 -n user.scoutfs-enospc -p "$file" -s 65535 > $T_TMP.cxl 2>&1
err="$?"
if grep -qi "no space" $T_TMP.cxl; then
echo "enospc at f $f" >> $T_TMP.cxl
break
fi
if [ "$err" != "0" ]; then
t_fail "create_xattr_loop failed with $err"
fi
done
echo "== remove files with xattrs after enospc"
rm -rf "$SCR/xattrs"
echo "== make sure we can create again"
file="$SCR/file-after"
touch $file
setfattr -n user.scoutfs-enospc -v 1 "$file"
sync
rm -f "$file"
echo "== cleanup small meta fs"
umount "$SCR"
rmdir "$SCR"
t_pass

View File

@@ -0,0 +1,32 @@
#
# test racing fh_to_dentry with evict from lock invalidation. We've
# had deadlocks between the ordering of iget and evict when they acquire
# cluster locks.
#
t_require_commands touch stat handle_cat
t_require_mounts 2
CPUS=$(getconf _NPROCESSORS_ONLN)
NR=$((CPUS * 4))
END=$((SECONDS + 30))
touch "$T_D0/file"
ino=$(stat -c "%i" "$T_D0/file")
while test $SECONDS -lt $END; do
for i in $(seq 1 $NR); do
fs=$((RANDOM % T_NR_MOUNTS))
eval dir="\$T_D${fs}"
write=$((RANDOM & 1))
if [ "$write" == 1 ]; then
touch "$dir/file" &
else
handle_cat "$dir" "$ino" &
fi
done
wait
done
t_pass

View File

@@ -0,0 +1,127 @@
#
# Fence nodes and reclaim their resources.
#
t_require_commands sleep touch grep sync scoutfs
t_require_mounts 2
#
# Make sure that all mounts can read the results of a write from each
# mount. And make sure that the greatest of all the written seqs is
# visible after the writes were commited by remote reads.
#
check_read_write()
{
local expected
local greatest=0
local seq
local path
local saw
local w
local r
for w in $(t_fs_nrs); do
expected="$w wrote at $(date --rfc-3339=ns)"
eval path="\$T_D${w}/written"
echo "$expected" > "$path"
seq=$(scoutfs stat -s meta_seq $path)
if [ "$seq" -gt "$greatest" ]; then
greatest=$seq
fi
for r in $(t_fs_nrs); do
eval path="\$T_D${r}/written"
saw=$(cat "$path")
if [ "$saw" != "$expected" ]; then
echo "mount $r read '$saw' after mount $w wrote '$expected'"
fi
done
done
seq=$(scoutfs statfs -s committed_seq -p $T_D0)
if [ "$seq" -lt "$greatest" ]; then
echo "committed_seq $seq less than greatest $greatest"
fi
}
echo "== make sure all mounts can see each other"
check_read_write
echo "== force unmount one client, connection timeout, fence nop, mount"
cl=$(t_first_client_nr)
sv=$(t_server_nr)
rid=$(t_mount_rid $cl)
echo "cl $cl sv $sv rid $rid" >> "$T_TMP.log"
sync
t_force_umount $cl
# wait for client reconnection to timeout
while grep -q $rid $(t_debugfs_path $sv)/connections; do
sleep .5
done
while t_rid_is_fencing $rid; do
sleep .5
done
t_mount $cl
check_read_write
echo "== force unmount all non-server, connection timeout, fence nop, mount"
sv=$(t_server_nr)
pattern="nonsense"
sync
for cl in $(t_fs_nrs); do
if [ $cl == $sv ]; then
continue;
fi
rid=$(t_mount_rid $cl)
pattern="$pattern|$rid"
echo "cl $cl sv $sv rid $rid" >> "$T_TMP.log"
t_force_umount $cl
done
# wait for all client reconnections to timeout
while egrep -q "($pattern)" $(t_debugfs_path $sv)/connections; do
sleep .5
done
# wait for all fence requests to complete
while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
sleep .5
done
# remount all the clients
for cl in $(t_fs_nrs); do
if [ $cl == $sv ]; then
continue;
fi
t_mount $cl
done
check_read_write
echo "== force unmount server, quorum elects new leader, fence nop, mount"
sv=$(t_server_nr)
rid=$(t_mount_rid $sv)
echo "sv $sv rid $rid" >> "$T_TMP.log"
sync
t_force_umount $sv
t_wait_for_leader
# wait until new server is done fencing unmounted leader rid
while t_rid_is_fencing $rid; do
sleep .5
done
t_mount $sv
check_read_write
echo "== force unmount everything, new server fences all previous"
sync
for nr in $(t_fs_nrs); do
t_force_umount $nr
done
t_mount_all
# wait for all fence requests to complete
while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
sleep .5
done
check_read_write
t_pass

View File

@@ -0,0 +1,98 @@
#
# test deleting an inode once all its links and references are gone.
#
t_require_commands cat scoutfs
t_require_mounts 2
FILE="$T_D0/file"
check_ino_index() {
local ino="$1"
local dseq="$2"
local mnt="$3"
t_sync_seq_index
scoutfs walk-inodes -p "$mnt" -- data_seq $dseq $(($dseq + 1)) |
awk 'BEGIN { not = "not " }
($4 == '$ino') { not = ""; exit; }
END { print "ino " not "found in dseq index" }'
}
echo "== basic unlink deletes"
echo "contents" > "$FILE"
ino=$(stat -c "%i" "$FILE")
dseq=$(scoutfs stat -s data_seq "$FILE")
check_ino_index "$ino" "$dseq" "$T_M0"
rm -f "$FILE"
check_ino_index "$ino" "$dseq" "$T_M0"
echo "== local open-unlink waits for close to delete"
echo "contents" > "$FILE"
ino=$(stat -c "%i" "$FILE")
dseq=$(scoutfs stat -s data_seq "$FILE")
exec {FD}<"$FILE" # open unused fd, assign to FD
rm -f "$FILE"
echo "contents after rm: $(cat <&$FD)"
check_ino_index "$ino" "$dseq" "$T_M0"
exec {FD}>&- # close
check_ino_index "$ino" "$dseq" "$T_M0"
echo "== multiple local opens are protected"
echo "contents" > "$FILE"
ino=$(stat -c "%i" "$FILE")
dseq=$(scoutfs stat -s data_seq "$FILE")
exec {FD1}<"$FILE"
exec {FD2}<"$FILE"
rm -f "$FILE"
echo "contents after rm 1: $(cat <&$FD1)"
echo "contents after rm 2: $(cat <&$FD2)"
check_ino_index "$ino" "$dseq" "$T_M0"
exec {FD1}>&- # close
exec {FD2}>&- # close
check_ino_index "$ino" "$dseq" "$T_M0"
echo "== remote unopened unlink deletes"
echo "contents" > "$T_D0/file"
ino=$(stat -c "%i" "$T_D0/file")
dseq=$(scoutfs stat -s data_seq "$T_D0/file")
rm -f "$T_D1/file"
check_ino_index "$ino" "$dseq" "$T_M0"
check_ino_index "$ino" "$dseq" "$T_M1"
echo "== unlink wait for open on other mount"
echo "contents" > "$T_D0/file"
ino=$(stat -c "%i" "$T_D0/file")
dseq=$(scoutfs stat -s data_seq "$T_D0/file")
exec {FD}<"$T_D0/file"
rm -f "$T_D1/file"
echo "mount 0 contents after mount 1 rm: $(cat <&$FD)"
check_ino_index "$ino" "$dseq" "$T_M0"
check_ino_index "$ino" "$dseq" "$T_M1"
exec {FD}>&- # close
# we know that revalidating will unhash the remote dentry
stat "$T_D0/file" 2>&1 | t_filter_fs
check_ino_index "$ino" "$dseq" "$T_M0"
check_ino_index "$ino" "$dseq" "$T_M1"
echo "== lots of deletions use one open map"
mkdir "$T_D0/dir"
touch "$T_D0/dir"/files-{1..5}
rm -f "$T_D0/dir"/files-*
rmdir "$T_D0/dir"
echo "== open files survive remote scanning orphans"
echo "contents" > "$T_D0/file"
ino=$(stat -c "%i" "$T_D0/file")
dseq=$(scoutfs stat -s data_seq "$T_D0/file")
exec {FD}<"$T_D0/file"
rm -f "$T_D0/file"
t_umount 1
t_mount 1
echo "mount 0 contents after mount 1 remounted: $(cat <&$FD)"
exec {FD}>&- # close
check_ino_index "$ino" "$dseq" "$T_M0"
check_ino_index "$ino" "$dseq" "$T_M1"
t_pass

View File

@@ -0,0 +1,59 @@
#
# Sequentially perform operations on a dir (mkdir; rename*2; rmdir) on
# all possible combinations of different mounts that could perform the
# operations.
#
# We're testing that the tracking of the entry key in our cached dirents
# stays consitent with the persistent entry items as they're modified
# around the cluster.
#
t_require_commands mkdir mv rmdir
NR_OPS=4
unset op_mnt
for op in $(seq 0 $NR_OPS); do
op_mnt[$op]=0
done
if [ $T_NR_MOUNTS -gt $NR_OPS ]; then
NR_MNTS=$NR_OPS
else
NR_MNTS=$T_NR_MOUNTS
fi
# test until final op mount dir wraps
while [ ${op_mnt[$NR_OPS]} == 0 ]; do
# sequentially perform each op from its mount dir
for op in $(seq 0 $((NR_OPS - 1))); do
m=${op_mnt[$op]}
eval dir="\$T_D${m}/dir"
case "$op" in
0) mkdir "$dir" ;;
1) mv "$dir" "$dir-1" ;;
2) mv "$dir-1" "$dir-2" ;;
3) rmdir "$dir-2" ;;
esac
if [ $? != 0 ]; then
t_fail "${op_mnt[*]} failed at op $op"
fi
done
# advance through mnt nrs for each op
i=0
while [ ${op_mnt[$NR_OPS]} == 0 ]; do
((op_mnt[$i]++))
if [ ${op_mnt[$i]} -ge $NR_MNTS ]; then
op_mnt[$i]=0
((i++))
else
break
fi
done
done
t_pass

View File

@@ -0,0 +1,77 @@
#
# make sure we clean up orphaned inodes
#
t_require_commands sleep touch sync stat handle_cat kill rm
t_require_mounts 2
#
# usually bash prints an annoying output message when jobs
# are killed. We can avoid that by redirecting stderr for
# the bash process when it reaps the jobs that are killed.
#
silent_kill() {
exec {ERR}>&2 2>/dev/null
kill "$@"
wait "$@"
exec 2>&$ERR {ERR}>&-
}
#
# We don't have a great way to test that inode items still exist. We
# don't prevent opening handles with nlink 0 today, so we'll use that.
# This would have to change to some other method.
#
inode_exists()
{
local ino="$1"
handle_cat "$T_M0" "$ino" > "$T_TMP.handle_cat.log" 2>&1
}
echo "== test our inode existance function"
path="$T_D0/file"
touch "$path"
ino=$(stat -c "%i" "$path")
inode_exists $ino || echo "$ino didn't exist"
echo "== unlinked and opened inodes still exist"
sleep 1000000 < "$path" &
pid="$!"
rm -f "$path"
inode_exists $ino || echo "$ino didn't exist"
echo "== orphan from failed evict deletion is picked up"
# pending kill signal stops evict from getting locks and deleting
silent_kill $pid
sleep 55
inode_exists $ino && echo "$ino still exists"
echo "== orphaned inos in all mounts all deleted"
pids=""
inos=""
for nr in $(t_fs_nrs); do
eval path="\$T_D${nr}/file-$nr"
touch "$path"
inos="$inos $(stat -c %i $path)"
sleep 1000000 < "$path" &
pids="$pids $!"
rm -f "$path"
done
sync
silent_kill $pids
for nr in $(t_fs_nrs); do
t_force_umount $nr
done
t_mount_all
# wait for all fence requests to complete
while test -d $(echo /sys/fs/scoutfs/*/fence/* | cut -d " " -f 1); do
sleep .5
done
# wait for orphan scans to run
sleep 55
for ino in $inos; do
inode_exists $ino && echo "$ino still exists"
done
t_pass

View File

@@ -0,0 +1,15 @@
#
# Run tmpfile_stage and check the output with hexdump.
#
t_require_commands stage_tmpfile hexdump
DEST_FILE="$T_D0/dest_file"
stage_tmpfile $T_D0 $DEST_FILE
hexdump -C "$DEST_FILE"
rm -fr "$DEST_FILE"
t_pass

View File

@@ -0,0 +1,35 @@
#!/usr/bin/bash
echo_fail() {
echo "$@" > /dev/stderr
exit 1
}
rid="$SCOUTFS_FENCED_REQ_RID"
#
# Look for a local mount with the rid to fence. Typically we'll at
# least find the mount with the server that requested the fence that
# we're processing. But it's possible that mounts are unmounted
# before, or while, we're running.
#
mnts=$(findmnt -l -n -t scoutfs -o TARGET) || \
echo_fail "findmnt -t scoutfs failed" > /dev/stderr
for mnt in $mnts; do
mnt_rid=$(scoutfs statfs -p "$mnt" -s rid) || \
echo_fail "scoutfs statfs $mnt failed"
if [ "$mnt_rid" == "$rid" ]; then
umount -f "$mnt" || \
echo_fail "umout -f $mnt"
exit 0
fi
done
#
# If the mount doesn't exist on this host then it can't access the
# devices by definition and can be considered fenced.
#
exit 0

94
utils/fenced/scoutfs-fenced Executable file
View File

@@ -0,0 +1,94 @@
#!/usr/bin/bash
message_output()
{
printf "[%s] %s\n" "$(date '+%F %T.%N')" "$@"
}
error_message()
{
message_output "$@" >> /dev/stderr
}
error_exit()
{
error_message "$@, exiting"
exit 1
}
log_message()
{
message_output "$@" >> /dev/stdout
}
# restart if we catch hup to re-read the config
hup_restart()
{
log_message "caught SIGHUP, restarting"
exec "$@"
}
trap hup_restart SIGHUP
# defaults
SCOUTFS_FENCED_CONFIG_FILE=${SCOUTFS_FENCED_CONFIG_FILE:-/etc/scoutfs/scoutfs-fenced.conf}
SCOUTFS_FENCED_DELAY=2
#SCOUTFS_FENCED_RUN
#SCOUTFS_FENCED_RUN_ARGS
test -n "$SCOUTFS_FENCED_CONFIG_FILE" || \
error_exit "SCOUTFS_FENCED_CONFIG_FILE isn't set"
test -r "$SCOUTFS_FENCED_CONFIG_FILE" || \
error_exit "SCOUTFS_FENCED_CONFIG_FILE isn't readable file"
log_message "reading config file $SCOUTFS_FENCED_CONFIG_FILE"
. "$SCOUTFS_FENCED_CONFIG_FILE" || \
error_exit "error sourcing $SCOUTFS_FENCED_CONFIG_FILE as bash script"
for conf in "${!SCOUTFS_FENCED_@}"; do
log_message " config var $conf=${!conf}"
done
test -n "$SCOUTFS_FENCED_RUN" || \
error_exit "SCOUTFS_FENCED_RUN must be set"
test -x "$SCOUTFS_FENCED_RUN" || \
error_exit "SCOUTFS_FENCED_RUN '$SCOUTFS_FENCED_RUN' isn't executable"
#
# main loop watching for fence request across all filesystems
#
while sleep $SCOUTFS_FENCED_DELAY; do
for fence in /sys/fs/scoutfs/*/fence/*; do
# catches unmatched regex when no dirs
if [ ! -d "$fence" ]; then
continue
fi
# skip requests that have been handled
if [ $(cat "$fence/fenced") == 1 -o $(cat "$fence/error") == 1 ]; then
continue
fi
srv=$(basename $(dirname $(dirname $fence)))
rid="$(cat $fence/rid)"
ip="$(cat $fence/ipv4_addr)"
reason="$(cat $fence/reason)"
log_message "server $srv fencing rid $rid at IP $ip for $reason"
# export _REQ_ vars for run to use
export SCOUTFS_FENCED_REQ_RID="$rid"
export SCOUTFS_FENCED_REQ_IP="$ip"
$run $SCOUTFS_FENCED_RUN_ARGS
rc=$?
if [ "$rc" != 0 ]; then
log_message "server $srv fencing rid $rid saw error status $rc from $run"
echo 1 > "$fence/error"
continue
fi
echo 1 > "$fence/fenced"
done
done

View File

@@ -0,0 +1,66 @@
.TH scoutfs-fenced 8
.SH NAME
scoutfs-fenced \- scoutfs fence request monitoring and dispatch daemon
.SH DESCRIPTION
The
.B scoutfs-fenced
daemon runs on hosts with mounts that are configured as quorum members
and could create fence requests. It watches sysfs directories of
mounted scoutfs volumes for the directories store requests
to fence a mount.
.SH ENVIRONMENT
scoutfs-fenced reads the
.I SCOUTFS_FENCED_CONFIG_FILE
environment variable for the path to the config file that contains its
configuration. The file must be readable and is sourced as a bash
script and is expected to set the following configuration variables.
.SH CONFIGURATION
.TP
.B SCOUTFS_FENCED_DELAY
The number of seconds to wait beteween checking for fence request
directories in the sysfs directories of all mounts on the host.
.TP
.B SCOUTFS_FENCED_RUN
The path to the command to execute for each fence request. The file at
the path must be executable.
.TP
.B SCOUTFS_FENCED_RUN_ARGS
The arguments that are unconditionally passed through to the run
command.
.SH DAEMONIZING AND LOGGING
scoutfs-fenced runs in the foreground and writes to stderr and stdout.
Disconnecting it from parents and redirecting its output are the
responsibility of the host environment.
.SH RUN COMMAND INTERFACE
scoutfs-fenced sets enviroment variables for the run command with
information about the mount that must be fenced:
.TP
.B SCOUTFS_FENCED_REQ_RID
The RID of the mount to be fenced.
.TP
.B SCOUTFS_FENCED_REQ_IP
The dotted quad IPv4 address of the last connection from the mount.
.RE
The return status of the run command indicates if the mount was
fenced, or not. If the mount was successfully fenced then the command
should return a 0 success status. If the run command returns a non-zero
failure status then the request will be set as errored and the server
will shut down. The next server that starts will create another fence
request for the mount.
.SH SEE ALSO
.BR scoutfs (5),
.SH AUTHORS
Zach Brown <zab@versity.com>

View File

@@ -1,6 +1,6 @@
.TH scoutfs 5
.SH NAME
scoutfs \- overview and mount options for the scoutfs filesystem
scoutfs \- high level overview of the scoutfs filesystem
.SH DESCRIPTION
A scoutfs filesystem is stored on two block devices. Multiple mounts of
the filesystem are supported between hosts that share access to the
@@ -34,7 +34,116 @@ the server for the filesystem if it is elected leader.
The assigned number must match one of the slots defined with \-Q options
when the filesystem was created with mkfs. If the number assigned
doesn't match a number created during mkfs then the mount will fail.
.SH FURTHER READING
.SH VOLUME OPTIONS
Volume options are persistent options which are stored in the super
block in the metadata device and which apply to all mounts of the volume.
.sp
Volume options may be initially specified as the volume is created
as described in the mkfs command in
.BR scoutfs (8).
.sp
Volume options may be changed at runtime by writing to files in sysfs
while the volume is mounted. Volume options are found in the
volume_options/ directory with a file for each option. Reading the
file provides the current setting of the option and an empty string
is returned if the option is not set. To set the option, write
the new value ofthe option to the file. To clear the option, write
a blank line with a newline to the file. The write syscall will
return an error if the set operation fails and a message will be written
to the console.
.sp
The following volume options are supported:
.TP
.B data_alloc_zone_blocks=<zone size in 4KiB blocks>
When the data_alloc_zone_blocks option is set the data device is
logically divided into zones of equal length as specified by the value
of the option. The size of the zones must be greater than a minimum
allocation pool size, large enough to result in no more than 1024 zones,
and not more than the total number of blocks in the data device.
.sp
When set, the server will try to provide each mount with free data
extents that don't share a zone with other mounts. When a mount has free
extents in a given zone the server will try and find more free extents
in that zone. When the mount is not in a zone, or its zone has no more
free extents, the server will try and find free extents in a zone that
no other mount currently occupies. The result is to try and produce
write streams where only one mount is writing into each zone.
.SH FENCING
.B scoutfs
mounts coordinate exclusive access to shared resources through
comminication with the mount that was elected leader.
A mount can malfunction and stop participating at which point it needs
to be safely isolated ("fenced off") from shared resources before other mounts can
have their turn at exclusive access.
.sp
Only the elected leader can fence mounts. As the leader decides that a
mount must be fenced, typically by timeouts expiring without
comminication from the mount, it creates a fence request. Fence
requests are visible as directories in the leader mount's sysfs
directory. The fence request directory is named for the RID of the
mount being fenced. The directory contains the following files:
.RS
.TP
.B elapsec_secs
Reading this file gives the number of seconds that have passed since
this fence request was created.
.TP
.B error
This file contains 0 when the fence request is created. Userspace
fencing agents write 1 into this file if they are unable to fence the
mount. The volume can not make progress until the mount is fenced so
this will cause the server to stop and another mount will be elected
leader.
.TP
.B fenced
This file contains 0 when the fence request is created. Userspace
fencing agents write 1 into this file once the mount has been fenced.
.TP
.B ipv4_addr
This file contains the dotted quad IPv4 peer address of the last
connected socket from the mount. Userspace fencing agents can use this
to find the host that contains the mount.
.TP
.B reason
This file contains a text string that indicates the reason that the
mount is being fenced:
.B client_recovery
- During startup the server found persistent items recording the presence
of a mount that didn't reconnect to the server in time.
.sp
.B client_reconnect
- A mount disconnected from the server and didn't reconnect in time.
.sp
.B quorum_block_leader
- As a leader was elected it read persistent blocks that indicated that
a previous leader had not shut down and cleared their quorum block.
.TP
.B rid
This file contains the hex string of the RID of the mount to be fenced.
.RE
The request directories enable userspace processes to gather the
information to find the host with the mount to fence, isolate the mount
by whatever means are appropriate (f.e. cut off network and storage
communication, force unmount the mount, isolate storage fabric ports,
reboot the host) and write to the
.I fenced
file.
.sp
Once the
.I fenced
file is written to the server reclaims the resources
associated with the fenced mount and resumes normal operations.
.sp
If the
.I error
file is written to then the server cannot make forward progress and
shuts down. The request can similarly enter an errored state if enough
time passes before userspace completes the request.
.SH CORRUPTION DETECTION
A
.B scoutfs
filesystem can detect corruption at runtime. A catalog of kernel log

View File

@@ -32,10 +32,18 @@ A path within a ScoutFS filesystem.
.PD
.TP
.BI "mkfs META-DEVICE DATA-DEVICE {-Q|--quorum-slot} NR,ADDR,PORT [-m|--max-meta-size SIZE] [-d|--max-data-size SIZE] [-f|--force]"
.BI "mkfs META-DEVICE DATA-DEVICE {-Q|--quorum-slot} NR,ADDR,PORT [-m|--max-meta-size SIZE] [-d|--max-data-size SIZE] [-z|--data-alloc-zone-blocks BLOCKS] [-f|--force] [-A|--allow-small-size]"
.sp
Initialize a new ScoutFS filesystem on the target devices. Since ScoutFS uses
separate block devices for its metadata and data storage, two are required.
The internal structures and nature of metadata and data transactions
lead to minimum viable device sizes.
.B mkfs
will check both devices and fail with an error if either are under the
minimum size. If
.B --allow-small-size
is given then sizes under the minimum size will be
allowed after printing an informational warning.
.sp
If
.B --force
@@ -81,6 +89,14 @@ kibibytes, mebibytes, etc.
.B "-d, --max-data-size SIZE"
Same as previous, but for limiting the size of the data device.
.TP
.B "-A, --allow-small-size"
Allows use of specified device sizes less than the minimum. This can
result in bad behaviour and is only intended for testing.
.TP
.B "-z, --data-alloc-zone-blocks BLOCKS"
Set the data_alloc_zone_blocks volume option, as described in
.BR scoutfs (5).
.TP
.B "-f, --force"
Ignore presence of existing data on the data and metadata devices.
.RE

View File

@@ -54,12 +54,15 @@ cp man/*.8.gz $RPM_BUILD_ROOT%{_mandir}/man8/.
install -m 755 -D src/scoutfs $RPM_BUILD_ROOT%{_sbindir}/scoutfs
install -m 644 -D src/ioctl.h $RPM_BUILD_ROOT%{_includedir}/scoutfs/ioctl.h
install -m 644 -D src/format.h $RPM_BUILD_ROOT%{_includedir}/scoutfs/format.h
install -m 755 -D fenced/scoutfs-fenced $RPM_BUILD_ROOT%{_libexecdir}/scoutfs-fenced/scoutfs-fenced
install -m 755 -D fenced/local-force-unmount $RPM_BUILD_ROOT%{_libexecdir}/scoutfs-fenced/run/local-force-unmount
%files
%defattr(644,root,root,755)
%{_mandir}/man*/scoutfs*.gz
%defattr(755,root,root,755)
%{_sbindir}/scoutfs
%{_libexecdir}/scoutfs-fenced
%files -n scoutfs-devel
%defattr(644,root,root,755)

View File

@@ -40,7 +40,7 @@ static void *alloc_val(struct scoutfs_btree_block *bt, int len)
{
le16_add_cpu(&bt->mid_free_len, -len);
le16_add_cpu(&bt->total_item_bytes, len);
return (void *)bt + le16_to_cpu(bt->mid_free_len);
return (void *)&bt->items[le16_to_cpu(bt->nr_items)] + le16_to_cpu(bt->mid_free_len);
}
/*

View File

@@ -6,12 +6,13 @@
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <errno.h>
#include <stdbool.h>
#include "sparse.h"
#include "dev.h"
int device_size(char *path, int fd,
u64 min_size, u64 max_size,
u64 min_size, u64 max_size, bool allow_small_size,
char *use_type, u64 *size_ret)
{
struct stat st;
@@ -63,10 +64,13 @@ int device_size(char *path, int fd,
if (size < min_size) {
fprintf(stderr,
BASE_SIZE_FMT" %s too small for min "
BASE_SIZE_FMT" %s device\n",
BASE_SIZE_FMT" %s device%s\n",
BASE_SIZE_ARGS(size), target_type,
BASE_SIZE_ARGS(min_size), use_type);
return -EINVAL;
BASE_SIZE_ARGS(min_size), use_type,
allow_small_size ? ", allowing with -A" : "");
if (!allow_small_size)
return -EINVAL;
}
*size_ret = size;

View File

@@ -1,6 +1,8 @@
#ifndef _DEV_H_
#define _DEV_H_
#include <stdbool.h>
#define BASE_SIZE_FMT "%.2f%s"
#define BASE_SIZE_ARGS(sz) size_flt(sz, 1), size_str(sz, 1)
@@ -8,7 +10,7 @@
#define SIZE_ARGS(nr, sz) (nr), size_flt(nr, sz), size_str(nr, sz)
int device_size(char *path, int fd,
u64 min_size, u64 max_size,
u64 min_size, u64 max_size, bool allow_small_size,
char *use_type, u64 *size_ret);
float size_flt(u64 nr, unsigned size);
char *size_str(u64 nr, unsigned size);

View File

@@ -86,6 +86,11 @@ static int do_df(struct df_args *args)
data_free += ade[i].blocks;
}
if (meta_free >= sfm.reserved_meta_blocks)
meta_free -= sfm.reserved_meta_blocks;
else
meta_free = 0;
snprintf(cells[0][0], CHARS, "Type");
snprintf(cells[0][1], CHARS, "Size");
snprintf(cells[0][2], CHARS, "Total");

View File

@@ -57,6 +57,15 @@ static int write_block(int fd, u32 magic, __le64 fsid, u64 seq, u64 blkno,
return 0;
}
/*
* Return the order of the length of a free extent, which we define as
* floor(log_8_(len)): 0..7 = 0, 8..63 = 1, etc.
*/
static u64 free_extent_order(u64 len)
{
return (flsll(len | 1) - 1) / 3;
}
/*
* Write the single btree block that contains the blkno and len indexed
* items to store the given extent, and update the root to point to it.
@@ -72,31 +81,61 @@ static int write_alloc_root(int fd, __le64 fsid,
root->total_len = cpu_to_le64(len);
memset(&key, 0, sizeof(key));
key.sk_zone = SCOUTFS_FREE_EXTENT_ZONE;
key.sk_type = SCOUTFS_FREE_EXTENT_BLKNO_TYPE;
key.skii_ino = cpu_to_le64(SCOUTFS_ROOT_INO);
key.sk_zone = SCOUTFS_FREE_EXTENT_BLKNO_ZONE;
key.skfb_end = cpu_to_le64(start + len - 1);
key.skfb_len = cpu_to_le64(len);
btree_append_item(bt, &key, NULL, 0);
memset(&key, 0, sizeof(key));
key.sk_zone = SCOUTFS_FREE_EXTENT_ZONE;
key.sk_type = SCOUTFS_FREE_EXTENT_LEN_TYPE;
key.skii_ino = cpu_to_le64(SCOUTFS_ROOT_INO);
key.skfl_neglen = cpu_to_le64(-len);
key.skfl_blkno = cpu_to_le64(start);
key.sk_zone = SCOUTFS_FREE_EXTENT_ORDER_ZONE;
key.skfo_revord = cpu_to_le64(U64_MAX - free_extent_order(len));
key.skfo_end = cpu_to_le64(start + len - 1);
key.skfo_len = cpu_to_le64(len);
btree_append_item(bt, &key, NULL, 0);
return write_block(fd, SCOUTFS_BLOCK_MAGIC_BTREE, fsid, seq, blkno,
SCOUTFS_BLOCK_LG_SHIFT, &bt->hdr);
}
#define SCOUTFS_SERVER_DATA_FILL_TARGET \
((4ULL * 1024 * 1024 * 1024) >> SCOUTFS_BLOCK_SM_SHIFT)
static bool invalid_data_alloc_zone_blocks(u64 total_data_blocks, u64 zone_blocks)
{
u64 nr;
if (zone_blocks == 0)
return false;
if (zone_blocks < SCOUTFS_SERVER_DATA_FILL_TARGET) {
fprintf(stderr, "setting data_alloc_zone_blocks to '%llu' failed, must be at least %llu mount data allocation target blocks",
zone_blocks, SCOUTFS_SERVER_DATA_FILL_TARGET);
return true;
}
nr = total_data_blocks / SCOUTFS_DATA_ALLOC_MAX_ZONES;
if (zone_blocks < nr) {
fprintf(stderr, "setting data_alloc_zone_blocks to '%llu' failed, must be greater than %llu blocks which results in max %u zones",
zone_blocks, nr, SCOUTFS_DATA_ALLOC_MAX_ZONES);
return true;
}
if (zone_blocks > total_data_blocks) {
fprintf(stderr, "setting data_alloc_zone_blocks to '%llu' failed, must be at most %llu total data device blocks",
zone_blocks, total_data_blocks);
return true;
}
return false;
}
struct mkfs_args {
char *meta_device;
char *data_device;
unsigned long long max_meta_size;
unsigned long long max_data_size;
u64 data_alloc_zone_blocks;
bool force;
bool allow_small_size;
int nr_slots;
struct scoutfs_quorum_slot slots[SCOUTFS_QUORUM_MAX_SLOTS];
};
@@ -177,13 +216,15 @@ static int do_mkfs(struct mkfs_args *args)
goto out;
}
ret = device_size(args->meta_device, meta_fd, 2ULL * (1024 * 1024 * 1024),
args->max_meta_size, "meta", &meta_size);
/* minumum meta device size to make reserved blocks reasonably large */
ret = device_size(args->meta_device, meta_fd, 64ULL * (1024 * 1024 * 1024),
args->max_meta_size, args->allow_small_size, "meta", &meta_size);
if (ret)
goto out;
ret = device_size(args->data_device, data_fd, 8ULL * (1024 * 1024 * 1024),
args->max_data_size, "data", &data_size);
/* .. then arbitrarily the same minimum data device size */
ret = device_size(args->data_device, data_fd, 64ULL * (1024 * 1024 * 1024),
args->max_data_size, args->allow_small_size, "data", &data_size);
if (ret)
goto out;
@@ -198,7 +239,7 @@ static int do_mkfs(struct mkfs_args *args)
super->version = cpu_to_le64(SCOUTFS_INTEROP_VERSION);
uuid_generate(super->uuid);
super->next_ino = cpu_to_le64(SCOUTFS_ROOT_INO + 1);
super->next_trans_seq = cpu_to_le64(1);
super->seq = cpu_to_le64(1);
super->total_meta_blocks = cpu_to_le64(last_meta + 1);
super->first_meta_blkno = cpu_to_le64(next_meta);
super->last_meta_blkno = cpu_to_le64(last_meta);
@@ -210,6 +251,17 @@ static int do_mkfs(struct mkfs_args *args)
member_sizeof(struct scoutfs_super_block, qconf.slots));
memcpy(super->qconf.slots, args->slots, sizeof(args->slots));
if (invalid_data_alloc_zone_blocks(le64_to_cpu(super->total_data_blocks),
args->data_alloc_zone_blocks)) {
ret = -EINVAL;
goto out;
}
if (args->data_alloc_zone_blocks) {
super->volopt.set_bits |= cpu_to_le64(SCOUTFS_VOLOPT_DATA_ALLOC_ZONE_BLOCKS_BIT);
super->volopt.data_alloc_zone_blocks = cpu_to_le64(args->data_alloc_zone_blocks);
}
/* fs root starts with root inode and its index items */
blkno = next_meta++;
btree_init_root_single(&super->fs_root, bt, 1, blkno);
@@ -471,6 +523,20 @@ static int parse_opt(int key, char *arg, struct argp_state *state)
prev_val, args->max_data_size);
break;
}
case 'A':
args->allow_small_size = true;
break;
case 'z': /* data-alloc-zone-blocks */
{
ret = parse_u64(arg, &args->data_alloc_zone_blocks);
if (ret)
return ret;
if (args->data_alloc_zone_blocks == 0)
argp_error(state, "must provide non-zero data-alloc-zone-blocks");
break;
}
case ARGP_KEY_ARG:
if (!args->meta_device)
args->meta_device = strdup_or_error(state, arg);
@@ -499,8 +565,10 @@ static int parse_opt(int key, char *arg, struct argp_state *state)
static struct argp_option options[] = {
{ "quorum-slot", 'Q', "NR,ADDR,PORT", 0, "Specify quorum slot addresses [Required]"},
{ "force", 'f', NULL, 0, "Overwrite existing data on block devices"},
{ "allow-small-size", 'A', NULL, 0, "Allow specified meta/data devices less than minimum, still warns"},
{ "max-meta-size", 'm', "SIZE", 0, "Use a size less than the base metadata device size (bytes or KMGTP units)"},
{ "max-data-size", 'd', "SIZE", 0, "Use a size less than the base data device size (bytes or KMGTP units)"},
{ "data-alloc-zone-blocks", 'z', "BLOCKS", 0, "Divide data device into block zones so each mounts writes to a zone (4KB blocks)"},
{ NULL }
};

View File

@@ -32,7 +32,7 @@ struct move_blocks_args {
static int do_move_blocks(struct move_blocks_args *args)
{
struct scoutfs_ioctl_move_blocks mb;
struct scoutfs_ioctl_move_blocks mb = {0};
int from_fd = -1;
int to_fd = -1;
int ret;

View File

@@ -1,3 +1,4 @@
#define _GNU_SOURCE /* ffsll for glibc < 2.27 */
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
@@ -157,7 +158,7 @@ static print_func_t find_printer(u8 zone, u8 type)
type <= SCOUTFS_INODE_INDEX_DATA_SEQ_TYPE)
return print_inode_index;
if (zone == SCOUTFS_RID_ZONE) {
if (zone == SCOUTFS_ORPHAN_ZONE) {
if (type == SCOUTFS_ORPHAN_TYPE)
return print_orphan;
}
@@ -209,8 +210,8 @@ static int print_logs_item(struct scoutfs_key *key, void *val,
/* only items in leaf blocks have values */
if (val) {
liv = val;
printf(" log_item_value: vers %llu flags %x\n",
le64_to_cpu(liv->vers), liv->flags);
printf(" log_item_value: seq %llu flags %x\n",
le64_to_cpu(liv->seq), liv->flags);
/* deletion items don't have values */
if (!(liv->flags & SCOUTFS_LOG_ITEM_FLAG_DELETION)) {
@@ -244,15 +245,15 @@ static int print_logs_item(struct scoutfs_key *key, void *val,
le64_to_cpu((p)->blkno), le64_to_cpu((p)->seq)
#define AL_HEAD_F \
AL_REF_F" total_nr %llu first_nr %u"
AL_REF_F" total_nr %llu first_nr %u flags 0x%x"
#define AL_HEAD_A(p) \
AL_REF_A(&(p)->ref), le64_to_cpu((p)->total_nr),\
le32_to_cpu((p)->first_nr)
le32_to_cpu((p)->first_nr), le32_to_cpu((p)->flags)
#define ALCROOT_F \
BTROOT_F" total_len %llu"
BTROOT_F" total_len %llu flags 0x%x"
#define ALCROOT_A(ar) \
BTROOT_A(&(ar)->root), le64_to_cpu((ar)->total_len)
BTROOT_A(&(ar)->root), le64_to_cpu((ar)->total_len), le32_to_cpu((ar)->flags)
#define SRE_FMT "%016llx.%llu.%llu"
#define SRE_A(sre) \
@@ -272,6 +273,9 @@ static int print_log_trees_item(struct scoutfs_key *key, void *val,
unsigned val_len, void *arg)
{
struct scoutfs_log_trees *lt = val;
u64 zones;
int bit;
int i;
printf(" rid %llu nr %llu\n",
le64_to_cpu(key->sklt_rid), le64_to_cpu(key->sklt_nr));
@@ -285,9 +289,12 @@ static int print_log_trees_item(struct scoutfs_key *key, void *val,
" data_avail: "ALCROOT_F"\n"
" data_freed: "ALCROOT_F"\n"
" srch_file: "SRF_FMT"\n"
" max_item_vers: %llu\n"
" max_item_seq: %llu\n"
" rid: %016llx\n"
" nr: %llu\n",
" nr: %llu\n"
" flags: %llx\n"
" data_alloc_zone_blocks: %llu\n"
" data_alloc_zones: ",
AL_HEAD_A(&lt->meta_avail),
AL_HEAD_A(&lt->meta_freed),
lt->item_root.height,
@@ -298,9 +305,24 @@ static int print_log_trees_item(struct scoutfs_key *key, void *val,
ALCROOT_A(&lt->data_avail),
ALCROOT_A(&lt->data_freed),
SRF_A(&lt->srch_file),
le64_to_cpu(lt->max_item_vers),
le64_to_cpu(lt->max_item_seq),
le64_to_cpu(lt->rid),
le64_to_cpu(lt->nr));
le64_to_cpu(lt->nr),
le64_to_cpu(lt->flags),
le64_to_cpu(lt->data_alloc_zone_blocks));
for (i = 0; i < SCOUTFS_DATA_ALLOC_ZONE_LE64S; i++) {
if (lt->data_alloc_zones[i] == 0)
continue;
zones = le64_to_cpu(lt->data_alloc_zones[i]);
while (zones) {
bit = ffsll(zones) - 1;
printf("%u ", (i * 64) + bit);
zones ^= (1ULL << bit);
}
}
printf("\n");
}
return 0;
@@ -339,14 +361,6 @@ static int print_srch_root_item(struct scoutfs_key *key, void *val,
return 0;
}
static int print_lock_clients_entry(struct scoutfs_key *key, void *val,
unsigned val_len, void *arg)
{
printf(" rid %016llx\n", le64_to_cpu(key->sklc_rid));
return 0;
}
static int print_trans_seqs_entry(struct scoutfs_key *key, void *val,
unsigned val_len, void *arg)
{
@@ -360,9 +374,79 @@ static int print_mounted_client_entry(struct scoutfs_key *key, void *val,
unsigned val_len, void *arg)
{
struct scoutfs_mounted_client_btree_val *mcv = val;
struct in_addr in;
printf(" rid %016llx flags 0x%x\n",
le64_to_cpu(key->skmc_rid), mcv->flags);
memset(&in, 0, sizeof(in));
in.s_addr = htonl(le32_to_cpu(mcv->addr.v4.addr));
printf(" rid %016llx ipv4_addr %s flags 0x%x\n",
le64_to_cpu(key->skmc_rid), inet_ntoa(in), mcv->flags);
return 0;
}
static int print_log_merge_item(struct scoutfs_key *key, void *val,
unsigned val_len, void *arg)
{
struct scoutfs_log_merge_status *stat;
struct scoutfs_log_merge_range *rng;
struct scoutfs_log_merge_request *req;
struct scoutfs_log_merge_complete *comp;
struct scoutfs_log_merge_freeing *fr;
switch (key->sk_zone) {
case SCOUTFS_LOG_MERGE_STATUS_ZONE:
stat = val;
printf(" status: next_range_key "SK_FMT" nr_req %llu nr_comp %llu"
" last_seq %llu seq %llu\n",
SK_ARG(&stat->next_range_key),
le64_to_cpu(stat->nr_requests),
le64_to_cpu(stat->nr_complete),
le64_to_cpu(stat->last_seq),
le64_to_cpu(stat->seq));
break;
case SCOUTFS_LOG_MERGE_RANGE_ZONE:
rng = val;
printf(" range: start "SK_FMT" end "SK_FMT"\n",
SK_ARG(&rng->start),
SK_ARG(&rng->end));
break;
case SCOUTFS_LOG_MERGE_REQUEST_ZONE:
req = val;
printf(" request: logs_root "BTROOT_F" logs_root "BTROOT_F" start "SK_FMT
" end "SK_FMT" last_seq %llu rid %016llx seq %llu flags 0x%llx\n",
BTROOT_A(&req->logs_root),
BTROOT_A(&req->root),
SK_ARG(&req->start),
SK_ARG(&req->end),
le64_to_cpu(req->last_seq),
le64_to_cpu(req->rid),
le64_to_cpu(req->seq),
le64_to_cpu(req->flags));
break;
case SCOUTFS_LOG_MERGE_COMPLETE_ZONE:
comp = val;
printf(" complete: root "BTROOT_F" start "SK_FMT" end "SK_FMT
" remain "SK_FMT" rid %016llx seq %llu flags %llx\n",
BTROOT_A(&comp->root),
SK_ARG(&comp->start),
SK_ARG(&comp->end),
SK_ARG(&comp->remain),
le64_to_cpu(comp->rid),
le64_to_cpu(comp->seq),
le64_to_cpu(comp->flags));
break;
case SCOUTFS_LOG_MERGE_FREEING_ZONE:
fr = val;
printf(" freeing: root "BTROOT_F" key "SK_FMT" seq %llu\n",
BTROOT_A(&fr->root),
SK_ARG(&fr->key),
le64_to_cpu(fr->seq));
break;
default:
printf(" (unknown log merge key zone %u)\n", key->sk_zone);
break;
}
return 0;
}
@@ -370,17 +454,17 @@ static int print_mounted_client_entry(struct scoutfs_key *key, void *val,
static int print_alloc_item(struct scoutfs_key *key, void *val,
unsigned val_len, void *arg)
{
if (key->sk_type == SCOUTFS_FREE_EXTENT_BLKNO_TYPE)
if (key->sk_zone == SCOUTFS_FREE_EXTENT_BLKNO_ZONE)
printf(" free extent: blkno %llu len %llu end %llu\n",
le64_to_cpu(key->skfb_end) -
le64_to_cpu(key->skfb_len) + 1,
le64_to_cpu(key->skfb_len),
le64_to_cpu(key->skfb_end));
else
printf(" free extent: blkno %llu len %llu neglen %lld\n",
le64_to_cpu(key->skfl_blkno),
-le64_to_cpu(key->skfl_neglen),
(long long)le64_to_cpu(key->skfl_neglen));
printf(" free extent: blkno %llu len %llu order %llu\n",
le64_to_cpu(key->skfo_end) - le64_to_cpu(key->skfo_len) + 1,
le64_to_cpu(key->skfo_len),
(long long)(U64_MAX - le64_to_cpu(key->skfo_revord)));
return 0;
}
@@ -800,16 +884,16 @@ static char *alloc_addr_str(union scoutfs_inet_addr *ia)
static int print_quorum_blocks(int fd, struct scoutfs_super_block *super)
{
struct print_events {
size_t offset;
char *name;
} events[] = {
OFF_NAME(write), OFF_NAME(update_term), OFF_NAME(set_leader),
OFF_NAME(clear_leader), OFF_NAME(fenced),
const static char *event_names[] = {
[SCOUTFS_QUORUM_EVENT_BEGIN] = "begin",
[SCOUTFS_QUORUM_EVENT_TERM] = "term",
[SCOUTFS_QUORUM_EVENT_ELECT] = "elect",
[SCOUTFS_QUORUM_EVENT_FENCE] = "fence",
[SCOUTFS_QUORUM_EVENT_STOP] = "stop",
[SCOUTFS_QUORUM_EVENT_END] = "end",
};
struct scoutfs_quorum_block *blk = NULL;
struct scoutfs_quorum_block_event *ev;
char *log_addr = NULL;
u64 blkno;
int ret;
int i;
@@ -818,6 +902,7 @@ static int print_quorum_blocks(int fd, struct scoutfs_super_block *super)
for (i = 0; i < SCOUTFS_QUORUM_BLOCKS; i++) {
blkno = SCOUTFS_QUORUM_BLKNO + i;
free(blk);
blk = NULL;
ret = read_block(fd, blkno, SCOUTFS_BLOCK_SM_SHIFT, (void **)&blk);
if (ret)
goto out;
@@ -825,28 +910,27 @@ static int print_quorum_blocks(int fd, struct scoutfs_super_block *super)
printf("quorum blkno %llu (slot %llu)\n",
blkno, blkno - SCOUTFS_QUORUM_BLKNO);
print_block_header(&blk->hdr, SCOUTFS_BLOCK_SM_SIZE);
printf(" term %llu random_write_mark 0x%llx flags 0x%llx\n",
le64_to_cpu(blk->term),
le64_to_cpu(blk->random_write_mark),
le64_to_cpu(blk->flags));
for (e = 0; e < array_size(events); e++) {
ev = (void *)blk + events[e].offset;
for (e = 0; e < array_size(event_names); e++) {
ev = &blk->events[e];
printf(" %12s: rid %016llx ts %llu.%08u\n",
events[e].name, le64_to_cpu(ev->rid),
le64_to_cpu(ev->ts.sec),
le32_to_cpu(ev->ts.nsec));
printf(" %12s: rid %016llx term %llu ts %llu.%08u\n",
event_names[e], le64_to_cpu(ev->rid), le64_to_cpu(ev->term),
le64_to_cpu(ev->ts.sec), le32_to_cpu(ev->ts.nsec));
}
}
ret = 0;
out:
free(log_addr);
free(blk);
return ret;
}
#define BTR_FMT "blkno %llu seq %016llx height %u"
#define BTR_ARG(rt) \
le64_to_cpu((rt)->ref.blkno), le64_to_cpu((rt)->ref.seq), (rt)->height
static void print_super_block(struct scoutfs_super_block *super, u64 blkno)
{
char uuid_str[37];
@@ -866,7 +950,7 @@ static void print_super_block(struct scoutfs_super_block *super, u64 blkno)
printf(" flags: 0x%016llx\n", le64_to_cpu(super->flags));
/* XXX these are all in a crazy order */
printf(" next_ino %llu next_trans_seq %llu\n"
printf(" next_ino %llu seq %llu\n"
" total_meta_blocks %llu first_meta_blkno %llu last_meta_blkno %llu\n"
" total_data_blocks %llu first_data_blkno %llu last_data_blkno %llu\n"
" meta_alloc[0]: "ALCROOT_F"\n"
@@ -876,13 +960,14 @@ static void print_super_block(struct scoutfs_super_block *super, u64 blkno)
" server_meta_avail[1]: "AL_HEAD_F"\n"
" server_meta_freed[0]: "AL_HEAD_F"\n"
" server_meta_freed[1]: "AL_HEAD_F"\n"
" lock_clients root: height %u blkno %llu seq %llu\n"
" mounted_clients root: height %u blkno %llu seq %llu\n"
" srch_root root: height %u blkno %llu seq %llu\n"
" trans_seqs root: height %u blkno %llu seq %llu\n"
" fs_root btree root: height %u blkno %llu seq %llu\n",
" fs_root: "BTR_FMT"\n"
" logs_root: "BTR_FMT"\n"
" log_merge: "BTR_FMT"\n"
" trans_seqs: "BTR_FMT"\n"
" mounted_clients: "BTR_FMT"\n"
" srch_root: "BTR_FMT"\n",
le64_to_cpu(super->next_ino),
le64_to_cpu(super->next_trans_seq),
le64_to_cpu(super->seq),
le64_to_cpu(super->total_meta_blocks),
le64_to_cpu(super->first_meta_blkno),
le64_to_cpu(super->last_meta_blkno),
@@ -896,21 +981,20 @@ static void print_super_block(struct scoutfs_super_block *super, u64 blkno)
AL_HEAD_A(&super->server_meta_avail[1]),
AL_HEAD_A(&super->server_meta_freed[0]),
AL_HEAD_A(&super->server_meta_freed[1]),
super->lock_clients.height,
le64_to_cpu(super->lock_clients.ref.blkno),
le64_to_cpu(super->lock_clients.ref.seq),
super->mounted_clients.height,
le64_to_cpu(super->mounted_clients.ref.blkno),
le64_to_cpu(super->mounted_clients.ref.seq),
super->srch_root.height,
le64_to_cpu(super->srch_root.ref.blkno),
le64_to_cpu(super->srch_root.ref.seq),
super->trans_seqs.height,
le64_to_cpu(super->trans_seqs.ref.blkno),
le64_to_cpu(super->trans_seqs.ref.seq),
super->fs_root.height,
le64_to_cpu(super->fs_root.ref.blkno),
le64_to_cpu(super->fs_root.ref.seq));
BTR_ARG(&super->fs_root),
BTR_ARG(&super->logs_root),
BTR_ARG(&super->log_merge),
BTR_ARG(&super->trans_seqs),
BTR_ARG(&super->mounted_clients),
BTR_ARG(&super->srch_root));
printf(" volume options:\n"
" set_bits: %016llx\n",
le64_to_cpu(super->volopt.set_bits));
if (le64_to_cpu(super->volopt.set_bits) & SCOUTFS_VOLOPT_DATA_ALLOC_ZONE_BLOCKS_BIT) {
printf(" data_alloc_zone_blocks: %llu\n",
le64_to_cpu(super->volopt.data_alloc_zone_blocks));
}
printf(" quorum config version %llu\n",
le64_to_cpu(super->qconf.version));
@@ -947,11 +1031,6 @@ static int print_volume(int fd)
ret = print_quorum_blocks(fd, super);
err = print_btree(fd, super, "lock_clients", &super->lock_clients,
print_lock_clients_entry, NULL);
if (err && !ret)
ret = err;
err = print_btree(fd, super, "mounted_clients", &super->mounted_clients,
print_mounted_client_entry, NULL);
if (err && !ret)
@@ -962,6 +1041,11 @@ static int print_volume(int fd)
if (err && !ret)
ret = err;
err = print_btree(fd, super, "log_merge", &super->log_merge,
print_log_merge_item, NULL);
if (err && !ret)
ret = err;
for (i = 0; i < array_size(super->server_meta_avail); i++) {
snprintf(str, sizeof(str), "server_meta_avail[%u]", i);
err = print_alloc_list_block(fd, str,

View File

@@ -37,6 +37,7 @@ static struct stat_more_field inode_fields[] = {
INODE_FIELD(data_version),
INODE_FIELD(online_blocks),
INODE_FIELD(offline_blocks),
{ .name = "crtime", .offset = INODE_FIELD_OFF(crtime_sec) },
{ NULL, }
};
@@ -60,6 +61,9 @@ static void print_inode_field(void *st, size_t off)
case INODE_FIELD_OFF(offline_blocks):
printf("%llu", stm->offline_blocks);
break;
case INODE_FIELD_OFF(crtime_sec):
printf("%llu.%09u", stm->crtime_sec, stm->crtime_nsec);
break;
};
}