The orphan inodes test needs to test if inode items exist as it
manipulates inodes. It used to open the inode by a handle but we're
fixing that to not allow opening unlinked files. The
get-allocated-inos ioctl tests for the presence of items owned by the
inode regardless of any other vfs state so we can use it to verify what
scoutfs is doing as we work with the vfs inodes.
Signed-off-by: Zach Brown <zab@versity.com>
Add the get-allocated-inos scoutfs command which wraps the
GET_ALLOCATED_INOS ioctl. It'll be used by tests to find items
associated with an inode instead of trying to open the inode by a
constructed handle after it was unlinked.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can give some indication of inodes that have inode
items. We're exposing this for tests that verify the handling of open
unlinked inodes.
Signed-off-by: Zach Brown <zab@versity.com>
We're adding an ioctl that wants to build inode item keys so let's
export the private inode key initializer.
Signed-off-by: Zach Brown <zab@versity.com>
This reverts commit 61ad844891.
This fix was trying to ensure that lock recovery response handling
can't run after farewell calls reclaim_rid() by jumping through a bunch
of hoops to tear down locking state as the first farewell request
arrived.
It introduced very slippery use after free during shutdown. It appears
that it was from drain_workqueue() previously being able to stop
chaining work. That's no longer possible when you're trying to drain
two workqueues that can queue work in each other.
We found a much clearer way to solve the problem so we can toss this.
Signed-off-by: Zach Brown <zab@versity.com>
We recently found that the server can send a farewell response and try
to tear down a client's lock state while it was still in lock recovery
with the client. The lock recovery response could add a lock
for the client after farell's reclaim_rid() had thought the client was
gone forever and tore down its locks.
This left a lock in the lock server that wasn't associated with any
clients and so could never be invalidated. Attempts to acquire
conflicting locks with it would hang forever, which we saw as hangs in
testing with lots of unmounting.
We tried to fix it by serializing incoming request handling and
forcefully clobbering the client's lock state as we first got
the farewell request. That went very badly.
This takes another approach of trying to explicitly wait for lock
recovery to finish before sending farewell responses. It's more in
line with the overall pattern of having the client be up and functional
until farewell tears it down.
With this in place we can revert the other attempted fix that was
causing so many problems.
Signed-off-by: Zach Brown <zab@versity.com>
The local-force-unmount fenced fencing script only works when all the
mounts are on the local host and it uses force unmount. It is only
used in our specific local testing scripts. Packaging it as an example
lead people to believe that it could be used to cobble together a
multi-host testing network, however temporary.
Move it from being in utils and packged to being private to our tests so
that it doesn't present an attractive nuisance.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_recov_shutdown() tried to move the recovery tracking structs off
the shared list and into a private list so they could be freed. But
then it went and walked the now empty shared list to free entries. It
should walk the private list.
This would leak a small amount of memory in the rare cases where the
server was shutdown while recovery was still pending.
Signed-off-by: Zach Brown <zab@versity.com>
Back when we added the get/commit transaction sequence numbers to the
log_trees we forgot to add them to the scoutfs print output.
Signed-off-by: Zach Brown <zab@versity.com>
The server's little set_shutting_down() helper accidentally used a read
barrier instead of a write barrier.
Signed-off-by: Zach Brown <zab@versity.com>
Tear down client lock server state and set a boolean so that
there is no race between client/server processing lock recovery
at the same time as farewell.
Currently there is a bug where if server and clients are unmounted
then work from the client is processed out of order, which leaves
behind a server_lock for a RID that no longer exists.
In order to fix this we need to serialize SCOUTFS_NET_CMD_FAREWELL
in recv_worker.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
This unit test reproduces the race we have between
client and server diong lock recovery while farewell
is processed.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
The max_seq and active reader mechanisms in the item cache stop readers
from reading old items and inserting them in the cache after newer items
have been reclaimed by memory pressure. The max_seq field in the pages
must reflect the greatest seq of the items in the page so that reclaim
knows that the page contains items newer than old readers and must not
be removed.
We update the page max_seq as items are inserted or as they're dirtied
in the page. There's an additional subtle effect that the max_seq can
also protect items which have been erased. Deletion items are erased
from the pages as a commit completes. The max_seq in that page will
still protect it from being reclaimed even though no items have that seq
value themselves.
That protection fails if the range of keys containing the erased item is
moved to another page with a lower max_seq. The item mover only
updated the destination page's max_seq for each item that was moved. It
missed that the empty space between the items might have a larger
max_seq from an erased item. We don't know where the erased item is so
we have to assume that a larger max_seq in the source page must be set
on the destination page.
This could explain very rare item cache corruption where nodes were
seeing deleted directory entry items reappearing. It would take a
specific sequence of events involving large directories with an isolated
removal, a delayed item cache reader, a commit, and then enough
insertions to split the page all happening in precisely the wrong
sequence.
Signed-off-by: Zach Brown <zab@versity.com>
Add a command to change the quorum config which starts by only supports
updating the super block whlie the file system is oflfine.
Signed-off-by: Zach Brown <zab@versity.com>
We're adding a command to change the quorum config which updates its
version number. Let's make the version a little more visible and start
it at the more humane 1.
Signed-off-by: Zach Brown <zab@versity.com>
Move the code that checks that the super is in use from
change-format-version into its own function in util.c. We'll use it in
an upcoming command to change the quorum config.
Signed-off-by: Zach Brown <zab@versity.com>
Move functions for printing and validating the quorum config from mkfs.c
to quorum.c so that they can be used in an upcoming command to change
the quorum config.
Signed-off-by: Zach Brown <zab@versity.com>
The change from --quorum-count to --quorum-slot forgot to update a
mention of the option in an error message in mkfs when it wasn't
provided.
Signed-off-by: Zach Brown <zab@versity.com>
We want to enable the test case for:
generic/023 - tests that renameat2 syscall exists
generic/024 - renameat2 with NOREPLACE flag
Move both generic/025 and 078 to the no run list so that
we can test the unsupported output if the flags were
passed that were not supported.
Example output:
generic/025 [not run] fs doesn't support RENAME_EXCHANGE
generic/078 [not run] fs doesn't support RENAME_WHITEOUT
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
The goal of the test case is to have two mount points
with two async calls made to do renameat2. This allows
for two calls to race to call renameat2 RENAME_NOREPLACE.
When this happens you expect one of them to fail with a
-EEXIST. This would validate that the new flag works.
Essentially one of the two calls to renameat should hit the
new RENAME_NOREPLACE code and exit early.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
Support generic renameat2 syscall then add support for the
RENAME_NOREPLACE flag. To suppor the flag we need to check
the existance of both entries and return -EXIST.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
The current test case attempts to create a state to read
by calling setattr and getattr in attempt to force block
cache reads. It so happens that this does not always force
cache block reads, which in rare cases causes this test case
to fail.
The new test case removes all the extra bouncing around of mount
points and we just directly call scoutfs df which will walk
everyone's allocators to summarize the block counts, which is
guaranteed to exist. Therefore, we do not have to create any sort
of state prior to trying to force a read.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
Let's try maintaining release notes in a file in the repo. There are
lots of schemes for associating commits and release notes and this seems
like the simplest place to start.
Signed-off-by: Zach Brown <zab@versity.com>
[85164.299902] scoutfs f.8c19e1.r.facf2e error: server error writing btree blocks: -5
[144308.589596] scoutfs f.c9397a.r.8ae97f error: server error -5 freeing merged btree blocks: looping commit del/upd freeing item
[174646.005596] scoutfs f.15f0b3.r.1862df error: server error -5 freeing merged btree blocks: final commit del/upd freeing item
[146653.893676] scoutfs f.c7f188.r.34e23c error: server error writing super block: -5
[273218.436675] scoutfs f.dd4157.r.f0da7e error: server failed to bind to 127.0.0.1:42002, err -98
[376832.542823] scoutfs f.049985.r.1a8987 error: error -5 reading quorum block 19 to update event 1 term 3
The above is an example output that will be filtered out
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
We do not want to short-circuit btree_walk early, it is
better to handle the force unmount on the caller side.
Therefore, remove this from btree_walk.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
If there is a forced unmount we call _net_shutdown from
umount_begin in order to tell the server and clients to
break out of pending network replies. We then add the call
to abort within the shutdown_worker since most of the mucking
with send and resend queues are all done there.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
Only BUG_ON for inconsistency and not do it for commit errors
or failure to delete the original request.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
In scoutfs_server_worker we do not properly handle the clean up
of _block_writer_init and alloc_init. On error paths we can clean
up the context if either of thoes are initialized we can call
alloc_prepare_commit or writer_forget_all to ensure we drop
the block references and clear the dirty status of all the blocks
in the writer.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
Remove a bunch of old language from the README. We're no longer in the
early days of the open release so we can remove all the alpha quality
language. And the system has grown sufficiently that the repo README
isn't a great place for a small getting started doc. There just isn't
room to do the subject justice. If we need such a thing for the
project we'll put it as a first order doc in the repo that'd be
distributed along with everything else.
Signed-off-by: Zach Brown <zab@versity.com>
In order to safely free blocks we need to first dirty
the work. This allows for resume later on without a double
free.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
As we update xattrs we need to update any existing old items with the
contents of the new xattr that uses those items. The loop that updated
existing items only took the old xattr size into account and assumed
that the new xattr would use those items. If the new xattr size used
fewer parts then the attempt to update all the old parts that weren't
covered by the new size would go very wrong. The length of the region
in the new xattr would be negative so it'd try to use the max part
length. Worse, it'd copy these max part length regions outside the
input new xattr buffer. Typically this would land in addressible memory
and copy garbage into the unused old items before they were later
deleted.
However, it could access so far outside the input buffer that it could
cross a page boudary into inaccessible memory and fault. We saw this in
the field while trying to repeatedly incrementally shrink a large xattr.
This fixes the loop that updates overlapping items between the new and
old xattr to start with the smaller of their two item counts. Now it
will only update items that are actually used by both xattrs and will
only safely access the new xattr input buffer.
Signed-off-by: Zach Brown <zab@versity.com>
From now on if we make incompatible changes to structures or messages
then we update the format version and ensure that the code can deal with
all the versions in its supported range.
Signed-off-by: Zach Brown <zab@versity.com>
We had arbitrarily chosen an ioctl code 's' to match scoutfs, but of
course that conflicts. This chooses an arbitrary hole in the upstream
reservations from inode-numbers.rst.
Then we make sure to have our _IO[WR] usage reflect the direction of the
final type paramater. For most of our ioctls userspace is writing an
argument parameter to perform an operation (that often has side
effects). Most of our ioctls should be _IOW because userspace is
writing the parameter, not _IOR (though the operation tends to read
state). A few ioctls copy output back to userspace in the parameter so
they're _IOWR.
Signed-off-by: Zach Brown <zab@versity.com>
The idea here was that we'd expand the size of the struct and
valid_bytes would tell the kernel which fields were present in
userspace's struct. That doesn't combine well with the ioctl convention
of having the size of the type baked into the ioctl number. We'll
remove this to make the world less surprising. If we expand the
interface we'd add additional ioctls and types.
Signed-off-by: Zach Brown <zab@versity.com>
While checking in on some other code I noticed that we have lingering
allocator and writer contexts over in the lock server. The lock server
used to manage its own client state and recovery. We've sinced moved
that into shared recov functionality in the server. The lock server no
longer manipulates its own btrees and doesn't need these unused
references to the server's contexts.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce some space between the current key zone and type values so
that we have room to insert new keys amongst the current keys if we need
to. A spacing of 4 is arbitrarily chosen as small enough to still give
us intuitively small numbers while leaving enough room to grow, given
how long its taken to come to the current number of keys.
Signed-off-by: Zach Brown <zab@versity.com>