The k-way merge function at the core of the srch file entry merging had
some bookkeeping math (calculating number of parents) that couldn't
handle merging a single incoming entry stream, so it threw a warning and
returned an error. When refusing to handle that case, it was assuming
that caller was trying to merge down a single log file which doesn't
make any sense.
But in the case of multiple small unsorted logs we can absolutely end up
with their entries stored in one sorted page. We have one sorted input
page that's merging multiple log files. The merge function is also the
path that writes to the output file so we absolutely need to handle this
case.
We more carefully calculate the number of parents, clamping it to one
parent when we'd otherwise get "(roundup(1) -> 1) - 1 == 0" when
calculating the number of parents from the number of inputs. We can
relax the warning and error to refuse to merge nothing.
The test triggers this case by putting single search entries in the log
files for mounts and unmounting them to force rotation of the mount log
files into mergable rotated log files.
Signed-off-by: Zach Brown <zab@versity.com>
Our statfs implementation had clients reading the super block and using
the next free inode number to guess how many inodes there might be. We
are very aggressive with giving directories private pools of inode
numbers to allocate from. They're often not used at all, creating huge
gaps in allocated inode numbers. The ratio of the average number of
allocations per directory to the batch size given to each directory is
the factor that the used inode count can be off by.
Now that we have a precise count of active inodes we can use that to
return accurate counts of inodes in the files fields in the statfs
struct. We still don't have static inode allocation so the fields don't
make a ton of sense. We fake the total and free count to give a
reasonable estimate of the total files that doesn't change while the
free count is calculated from the correct count of used inodes.
While we're at it we add a request to get the summed fields that the
server can cheaply discover in cache rather than having the client
always perform read IOs.
Signed-off-by: Zach Brown <zab@versity.com>
Add an alloc_foreach variant which uses the caller's super to walk the
allocators rather than always reading it off the device.
Signed-off-by: Zach Brown <zab@versity.com>
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct. Client transactions track the change in
inode count as they create and delete inodes. The log_trees delta is
added to the count in the super as finalized log_trees are deleted.
Signed-off-by: Zach Brown <zab@versity.com>
We had previously started on a relatively simple notion of an
interoperability version which wasn't quite right. This fleshes out
support for a more functional format version. The super blocks have a
single version that defines behaviour of the running system. The code
supports a range of versions and we add some initial interfaces for
updating the version while the system is offline. All of this together
should let us safely change the underlying format over time.
Signed-off-by: Zach Brown <zab@versity.com>
Add a write_nr field to the quorum block header which is incremented
with every write. Each event also gets a write_nr field that is set to
the incremented value from the header. This gives us a history of the
order of event updates that isn't sensitive to misconfigured time.
Signed-off-by: Zach Brown <zab@versity.com>
We're adding another command that does block IO so move some block
reading and writing functions out of mkfs. We also grow a few function
variants and call the write_sync variant from mkfs instead of having it
manually sync.
Signed-off-by: Zach Brown <zab@versity.com>
The code that shows the note sections as files uses the section size to
define the size of the notes payload. We don't need to null terminate
the strings to define their lengths. Doing so puts a null in the notes
file which isn't appreciated by many readers.
Signed-off-by: Zach Brown <zab@versity.com>
The test harness might as well use all cpus when building. It's
reasonably safe to assume both that the test systems are otherwise idle
and that the build is likely to succeed.
Signed-off-by: Zach Brown <zab@versity.com>
TCP keepalive probes only work when the connection is idle. They're not
sent when there's unacked send data being retramnsmitted. If the server
fails while we're retransmitting we don't break the connection and try
to elect and connect to a new server until the very long default
conneciton timeouts or the server comes back and the stale connection is
aborted.
We can set TCP_USER_TIMEOUT to break an unresponsive connection when
there's written data. It changes the behavior of the keepalive probes
so we rework them a bit to clearly apply our timeout consistently
between the two mechanisms.
Signed-off-by: Zach Brown <zab@versity.com>
As the server comes up it needs to fence any previous servers before it
assumes exclusive access to the device. If fencing fails it can leave
fence requests behind. The error path for these very early failures
didn't shut down fencing so we'd have lingering fence requests span the
life cycle of server startup and shutdown. The next time the server
starts up in this mount it can try to create the fence request again,
get an error because a lingering one already exists, and immediately
shut down.
The result is that fencing errors that hit that initial attempt during
server startup can become persistent fencing errors for the lifetime of
that mount, preventing it from every successfully starting the server.
Moving the fence stop call to hit all exiting error paths consistently
clean up fence requests and avoid this problem. The next server
instance will get a chance to process the fence request again. It might
well hit the same error, but at least it gets a chance.
Signed-off-by: Zach Brown <zab@versity.com>
The current script gets stuck in an infinite loop when the test
suite is started with 1 mount point. This is due to the advancement
part of the script in which it advances the ops for each mount.
The current while loop checks for when the op_mnt wraps by checking if
it equals 0. But the problem is we set each of the op_mnts to 0 during
the advancement, so when it wraps it still equates to 0, so it is an
infinite loop. Therefore, the fix is to check at the end of the loop
check if the last op's mount number wrapped. If so just break out.
Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>
In some of the allocation paths there are goto statements
that end up calling kfree(). That is fine, but in cases
where the pointer is not initially set to NULL then we
might have an undefined behavior. kfree on a NULL pointer
does nothing, so essentially these changes should not
change behavior, but clarifies the code path better.
Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>
Unfortunately, we're back in kernels that don't yet have d_op->d_init.
We allocate our dentry info manually as we're given dentries. The
recent verification work forgot to consistently make sure the info was
allocated before using it. Fix that up, and while we're at it be a bit
more robust in how we check to see that it's been initialized without
grabbing the d_lock.
Signed-off-by: Zach Brown <zab@versity.com>
This adds i_version to our inode and maintains it as we allocate, load,
modify, and store inodes. We set the flag in the superblock so
in-kernel users can use i_version to see changes in our inodes.
Signed-off-by: Zach Brown <zab@versity.com>
More recent gcc notices that ret in delete_files can be undefined if nr
is 0 while missing that we won't call delete_files in that case. Seems
worth fixing, regardless.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quick test to make sure that create is validating stale dentries
before deciding if it should create or return -eexist.
Signed-off-by: Zach Brown <zab@versity.com>
Add the .totl. xattr tag. When the tag is set the end of the name
specifies a total name with 3 encoded u64s separated by dots. The value
of the xattr is a u64 that is added to the named total. An ioctl is
added to read the totals.
Signed-off-by: Zach Brown <zab@versity.com>
The fs log btrees have values that start with a header that stores the
item's seq and flags. There's a lot of sketchy code that manipulates
the value header as items are passed around.
This adds the seq and flags as core item fields in the btree. They're
only set by the interfaces that are used to store fs items: _insert_list
and _merge. The rest of the btree items that use the main interface
don't work with the fields.
This was done to help delta items discover when logged items have been
merged before the finalized lob btrees are deleted and the code ends up
being quite a bit cleaner.
Signed-off-by: Zach Brown <zab@versity.com>
Add an inode creation time field. It's created for all new inodes.
It's visible to stat_more. setattr_more can set it during
restore.
Signed-off-by: Zach Brown <zab@versity.com>
Our dir methods were trusting dentry args. The vfs code paths use
i_mutex to protect dentries across revalidate or lookup and method
calls. But that doesn't protect methods running in other mounts.
Multiple nodes can interleave the initial lookup or revalidate then
actual method call.
Rename got this right. It is very paranoid about verifying inputs after
acquiring all the locks it needs.
We extend this pattern to the rest of the methods that need to use the
mapping of name to inode (and our hash and pos) in dentries. Once we
acquire the parent dir lock we verify that the dentry is still current,
returning -EEXIST or -ENOENT as appropriate.
Along these lines, we tighten up dentry info correctness a bit by
updating our dentry info (recording lock coverage and hash/pos) for
negative dentries produced by lookup or as the result of unlink.
Signed-off-by: Zach Brown <zab@versity.com>
Client lock invalidation handling was very strict about not receiving
duplicate invalidation requests from the server because it could only
track one pending request. The promise to only send one invalidate at a
time is made by one server, it can't be enforced across server failover.
Particularly because invalidation processing can have to do quite a lot
of work with the server as it tears down state associated with the lock.
We fix this by recording and processing each individual incoming
invalidation request on the lock.
The code that handled reordering of incoming grant responses and
invalidation requests waited for the lock's mode to match the old mode
in the invalidation request before proceeding. That would have
prevented duplicate invalidation requests from making forward progress.
To fix this we make lock client recieve processing synchronous instead
of going through async work which can reorder. Now grant responses are
processed as they're received and will always be resolved before all the
invalidation requests are queued and processed in order.
Signed-off-by: Zach Brown <zab@versity.com>
The forest reader reads items from the fs_root and all log btrees and
gives them to the caller who tracks them to resolve version differences.
The reads can run into stale blocks which have been overwritten. The
forest reader was implementing the retry under the item state in the
caller. This can corrupt items that are only seen firest in an old fs
root before a merge and then only seen in the fs_root after a merge. In
this case the item won't have any versioning and the existing version
from the old fs_root is preferred. This is particularly bad when the
new version was deleted -- in that case we have no metadata which would
tell us to drop the old item that was read from the old fs_root.
This is fixed by pushing the retry up to callers who wipe the item state
before each retry. Now each set of items is related to a single
snapshot of the fs_root and logs at one point in time.
I haven't seen definitive evidence of this happening in practice. I
found this problem after putting on my craziest thinking toque and
auditing the code for places where we could lose item updates.
Signed-off-by: Zach Brown <zab@versity.com>
Btree merging attempted to build an rbtree of the input roots with only
one version of an item present in the rbtree at a time. It really
messed this up by completely dropping an input root when a root with a
newer version of its item tried to take its place in the rbtree. What
it should have done is advance to the next item in the older root, which
itself could have required advancing some other older root. Dropping
the root entirely is catastrophically wrong because it hides the rest of
the items in the root from merging. This has been manifesting as
occasional mysterious item loss during tests where memory pressure, item
update patterns, and merging all lined up just so.
This fixes the problem by more clearly keeping the next item in each
root in the rbtree. We sort by newest to oldest version so that once
we merge the most recent version of an item its easy to skip all the
older versions of the items in the next rbtree entries for the
rest of the input roots.
While we're at it we work with references to the static cached input
btree blocks. The old code was a first pass that used an expensive
btree walk per item and copied the value payload.
Signed-off-by: Zach Brown <zab@versity.com>
When the xattr inode searchs fail the test will eventually fail when the
output differs, but that could take a while. Have it fail much sooner
so that we can have tighter debugging interations and trace ring buffer
contents that are likely to be a lot closer to the first failure.
Signed-off-by: Zach Brown <zab@versity.com>
The current orphan scan uses the forest_next_hint to look for candidate
orphan items to delete. It doesn't skip deleted items and checks the
forest of log btrees so it'd return hints for every single item that
existed in all the log btrees across the system. And we call the hint
per-item.
When the system is deleting a lot of files we end up generating a huge
load where all mounts are constantly getting the btree roots from the
server, reading all the newest log btree blocks, finding deleted orphan
items for inodes that have already been deleted, and moving on to the
next deleted orphan item.
The fix is to use a read-only traversal of only one version of the fs
root for all the items in one scan. This avoids all the deleted orphan
items that exist in the log btrees which will disappear when they're
merged. It lets the item iteration happen in a single read-only cached
btree instead of constantly reading in the most recently written root
block of every log btree.
The result is an enormous speedup of large deletions. I don't want to
describe exactly how enormous.
Signed-off-by: Zach Brown <zab@versity.com>
We can be performing final deletion as inodes are evicted during
unmount. We have to keep full locking, transactions, and networking up
and running for the evict_inodes() call in generic_shutdown_super().
Unfortunately, this means that workers can be using inode references
during evict_inodes() which prevents them from being evicted. Those
workers can then remain running as we tear down the system, causing
crashes and deadlocks as the final iputs try to use resources that have
been destroyed.
The fix is to first properly stop orphan scanning, which can instantiate
new cached inodes, up before the call to kill_block_super ends up trying
to evict all inodes. Then we just need to wait for any pending iput and
invalidate work to finish and perform the final iput, which will always
evict because generic_shutdown_super has cleared MS_ACTIVE.
Signed-off-by: Zach Brown <zab@versity.com>
Add some simple tracking of message counts for each lock in the lock
server so that we can start to see where conflicts may be happening in a
running system.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quick helper that can be used to avoid doing work if we know that
we're already shutting down. This can be a single coarser indicator
than adding functions to each subsystem to track that we're shutting
down.
Signed-off-by: Zach Brown <zab@versity.com>
Currently the first inode number that can be allocated directly follows
the root inode. This means the first batch of allocated inodes are in
the same lock group as the root inode.
The root inode is a bit special. It is always hot as absolute path
lookups and inode-to-path resolution always read directory entries from
the root.
Let's try aligning the first free inode number to the next inode lock
group boundary. This will stop work in those inodes from necessarily
conflicting with work in the root inode.
Signed-off-by: Zach Brown <zab@versity.com>
We had some logic to try and delay lock invalidation while the lock was
still actively in use. This was trying to reduce the cost of
pathological lock conflict cases but it had some severe fairness
problems.
It was first introduced to deal with bad patterns in userspace that no
longer exist and it was built on top of the LSM transaction machinery
that also no longer exists. It hasn't aged well.
Instead of introducing invalidation latency in the hopes that it leads
to more batched work, which it can't always, let's aim more towards
reducing latency in all parts of the write-invalidate-read path and
also aim towards reducing contention in the first place.
Signed-off-by: Zach Brown <zab@versity.com>
We have a problem where items can appear to go backwards in time because
of the way we chose which log btrees to finalize and merge.
Because we don't have versions in items in the fs_root, and even might
not have items at all if they were deleted, we always assume items in
log btrees are newer than items in the fs root.
This creates the requirement that we can't merge a log btree if it has
items that are also present in older versions in other log btrees which
are not being merged. The unmerged old item in the log btree would take
precedent over the newer merged item in the fs root.
We weren't enforcing this requirement at all. We used the max_item_seq
to ensure that all items were older than the current stable seq but that
says nothing about the relationship between older items in the finalized
and active log btrees. Nothing at all stops an active btree from having
an old version of a newer item that is present in another mount's
finalized log btree.
To reliably fix this we create a strict item seq discontinuity between
all the finalized merge inputs and all the active log btrees. Once any
log btree is naturally finalized the server forced all the clients to
group up and finalize all their open log btrees. A merge operation can
then safely operate on all the finalized trees before any new trees are
given to clients who would start using increasing items seqs.
Signed-off-by: Zach Brown <zab@versity.com>
Add a command for the server to request that clients commit their open
transaction. This will be used to create groups of finalized log
btrees for consistent merging.
Signed-off-by: Zach Brown <zab@versity.com>
We were checking that quorum_slot_nr was within the range of possible
slots allowed by the format as it was parsed. We weren't checking that
it referenced a configured slot. Make sure, and give a nice error
message that shows the configured slots.
Signed-off-by: Zach Brown <zab@versity.com>
During rough forced unmount testing we saw a seemingly mysterious
concurrent election. It could be explained if mounts coming up don't
start with the same term. Let's try having mounts initialize their term
to the greatest of all the terms they can see in the quorum blocks.
This will prevent the situation where some new quorum actors with
greater terms start out ignoring all the messages from others.
Signed-off-by: Zach Brown <zab@versity.com>
Nothing interesting here, just a minor convenience to use test and set
instead of testing and then setting.
Signed-off-by: Zach Brown <zab@versity.com>
The server doesn't give us much to go on when it gets an error handling
requests to work with log trees from the client. This adds a lot of
specific error messages so we can get a better understanding of
failures.
Signed-off-by: Zach Brown <zab@versity.com>
We were trusting the rid in the log trees struct that the client sent.
Compare it to our recorded rid on the connection and fail if the client
sent the wrong rid.
Signed-off-by: Zach Brown <zab@versity.com>
The locking protocol only allows one outstanding invalidation request
for a lock at a time. The client invalidation state is a bit hairy and
involves removing the lock from the invalidation list while it is being
processed which includes sending the response. This means that another
request can arrive while the lock is not on the invalidation list. We
have fields in the lock to record another incoming request which puts
the lock back on the list.
But the invalidation work wasn't always queued again in this case. It
*looks* like the incoming request path would queue the work, but by
definition the lock isn't on the invalidation list during this race. If
it's the only lock in play then the invalidation list will be empty and
the work won't be queued. The lock can get stuck with a pending
invalidation if nothing else kicks the invaliation worker. We saw this
in testing when the root inode lock group missed the wakeup.
The fix is to have the work requeue itself after putting the lock back
on the invalidation list when it notices that another request came in.
Signed-off-by: Zach Brown <zab@versity.com>