The server was checking all client log_trees items to search for the
lowest commit seq that was still open. This can be expensive when there
are a lot of finalized log_trees items that won't have open seqs. Only
the last log_trees item for each client rid can be open, and the items
are sorted by rid and nr, so we can easily only check the last item for
each client rid.
Signed-off-by: Zach Brown <zab@versity.com>
During get_log_trees the server checks log_trees items to see if it
should start a log merge operation. It did this by iterating over all
log_trees items and there can be quite a lot of them.
It doesn't need to see all of the items. It only needs to see the most
recent log_trees item for each mount. That's enough to make the
decisions that start the log merging process.
Signed-off-by: Zach Brown <zab@versity.com>
Add a sysfs file for getting and setting the delay between srch
compaction requests from the client. We'll use this in testing to
ensure compaction runs promptly.
Signed-off-by: Zach Brown <zab@versity.com>
Compacting sorted srch files can take multiple transactions because they
can be very large. Each transaction resumes at a byte offset in a block
where the previous transaction stopped.
The resuming code tests that the byte offsets are sane but had a mistake
in testing the offset to skip to. It returned an error if the
compaction resumed from the last possible safe offset for decoding
entries.
If a system is unlucky enough to have a compaction transaction stop at
just this offset then compaction stops making forward progress as each
attempt to resume returns an error.
The fix allows continuation from this last safe offset while returning
errors for attempts to continue *past* that offset. This matches all
the encoding code which allows encoding the last entry in the block at
this offset.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test for srch compaction getting stuck hitting errors continuing a
partial operation. It ensures that a block has an encoded entry at
the _SAFE_BYTES offset, that an operaton stops precisely at that
offset, and then watches for errors.
Signed-off-by: Zach Brown <zab@versity.com>
The srch compaction request building function and the srch compaction
worker both have logic to recognize a valid response with no input files
indicating that there's no work to do. The server unfortunately
translated nr == 0 into ENOENT and send that error response to the
client. This caused the client to increment error counters in the
common case when there's no compaction work to perform. We'd like the
error counter to reflect actual errors, we're about to check it in a
test, so let's fix this up to the server sends a sucessful response with
nr == 0 to indicate that there's no work to do.
Signed-off-by: Zach Brown <zab@versity.com>
The server had a few lower level seqcounts that it used to protect
state. One user got it wrong by forgetting to disable pre-emption
around writers. Debug kernels warned as write_seqcount_begin() was
called without preemption disabled.
We fix that user and make it easier to get right in the future by having
one higher level seqlock and using that consistently for seq read
begin/retry and write lock/unlock patterns.
Signed-off-by: Zach Brown <zab@versity.com>
The rpmbuild support files no longer define the previously used kernel
module macros. This carves out the differences between el7 and el8 with
conditionals based on the distro we are building for.
Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
In rhel7 this is a nested struct with ktime_t. However, in rhel8
ktime_t is a simple s64, and not a union, and thus we can't do
this as easily. Just memset it.
Signed-off-by: Auke Kok <auke.kok@versity.com>
New kernels expect to do a partial match when a .prefix is used here,
and provide a .name member in case matching should look at the whole
string. This is what we want.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The caller takes care of caching for us. Us doing caching
messes with memory management of cached ACLs and breaks.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The aio_read and aio_write callbacks are no longer used by newer
kernels which now uses iter based readers and writers.
We can avoid implementing plain .read and .write as an iter will
be generated when needed for us automatically.
We add a new data_wait_check_iter() function accordingly.
With these methods removed from the kernel, the el8 kernel no
longer uses the extended ops wrapper struct and is much closer now
to upstream. As a result, a lot of methods are moving around from
inode_dir_operations to and from inode_file_operations etc, and
perhaps things will look a bit more structured as a result.
As a result, we need a slightly different data_wait_check() that
accounts for the iter and offset properly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
.readpages is obsolete in el8 kernels. We implement the .readahead
method instead which is passed a struct readahead_control. We use
the readahead_page(rac) accessor to retrieve page by page from the
struct.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v4.9-12228-g530e9b76ae8f Drops all (un)register_(hot)cpu_notifier()
API functions. From here on we need to use the new cpuhp_* API.
We avoid this entirely for now, at the cost of leaking pages until
the filesystem is unmounted.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Convert the timeout struct unto a u64 nsecs value before passing it to
the trace point event, as to not overflow the 64bit limitation on args.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v4.16-rc1-1-g9b2c45d479d0
This interface now returns (sizeof (addr)) on success, instead of 0.
Therefore, we have to change the error condition detection.
The compat for older kernels handles the addrlen check internally.
Signed-off-by: Auke Kok <auke.kok@versity.com>
MS_* flags from <linux/mount.h> should not be used in the kernel
anymore from 4.x onwards. Instead, we need to use the SB_* versions
Signed-off-by: Auke Kok <auke.kok@versity.com>
Move to the more recent interfaces for counting and scanning cached
objects to shrink.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
Move towards modern bio intefaces, while unfortunately carrying along a
bunch of compat functions that let us still work with the old
incompatible interfaces.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
memalloc_nofs_save() was introduced as preferential to trying to use GFP
flags to indicate that a task should not recurse during reclaim. We use
it instead of the _noio_ we were using before.
Signed-off-by: Zach Brown <zab@versity.com>
__percpu_counter_add_batch was renamed to make it clear that the __
doesn't mean it's less safe, as it means in other calls in the API, but
just that it takes an additional parameter.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
There are new interfaces available but the old one has been retained
for us to use. In case of older kernels, we will need to fall back
to the previous name of these functions.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Provide fallback in degraded mode for kernels pre-v4.15-rc3 by directly
manipulating the member as needed.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Since v4.6-rc3-27-g9902af79c01a, inode->i_mutex has been replaced
with ->i_rwsem. However, long since whenever, inode_lock() and
related functions already worked as intended and provided fully
exclusive locking to the inode.
To avoid a name clash on pre-rhel8 kernels, we have to rename a
stack variable in `src/file.c`.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Since v4.15-rc3-4-gae5e165d855d, <linux/iversion.h> contains a new
inode->i_version API and it is not included by default.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The new variant of the code that recomputes the augmented value
is designed to handle non-scalar types and to facilitate that, it
has new semantics for the _compute callback. It is now passed a
boolean flag `exit` that indicates that if the value isn't changed,
it should exit and halt propagation.
The callback function now shall return whether that propagation should
stop or not, and not the computed new value. The callback can now
directly update the new computed value in the node.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Fixes: Error: implicit declaration of function ‘blkdev_put’
Previously this was an `extern` in <fs.h> and included implicitly,
hence the need to hard include it now.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v4.1-rc4-22-g92cf211874e9 merges this into preempt.h, and on
rhel7 kernels we don't need this include anymore either.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v3.15-rc1-6-g1a56f2aa4752 removes flush_work_sync entirely, but
ever since v3.6-rc1-25-g606a5020b9bd which made all workqueues
non-reentrant, it has been equivalent to flush_work.
This is safe because in all cases only one server->work can be
in flight at a time.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v3.18-rc3-2-g230fa253df63 forces us to remove ACCESS_ONCE() with
READ_ONCE(), but it is probably the better interface and works with
non-scalar types.
Signed-off-by: Auke Kok <auke.kok@versity.com>
PAGE_CACHE_SIZE was previously defined to be equivalent to PAGE_SIZE.
This symbol was removed in v4.6-rc1-32-g1fa64f198b9f.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Because we `-include src/kernelcompat.h` from the command line,
this header gets included before any of the kernel includes in
most .c and .h files. We should at least make sure we pull in
<fs> and <kernel> since they're required.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Server code that wants to dirty blocks by holding a commit won't be
allowed to until the current allocators for the server transaction have
enough space for the holder. As an active holder applies the commit the
allocators are refilled and the waiting holders will proceed.
But the current allocators can have no resources as the server starts
up. There will never be active holders to apply the commit and refill
the allocators. In this case all the holders will block indefinitely.
The fix is to trigger a server commit when a holder doesn't have room.
It used to be that commits were only triggered when apply callers were
waiting. We transfer some of that logic into a new 'committing' field
so that we can have commits in flight without apply callers waiting. We
add it to the server commit tracing.
While we're at it we clean up the logic that tests if a hold can
proceed. It used to be confusingly split across two functions that both
could sample the current allocator space remaining. This could lead to
weird cases where the first holder could use the second alloc remaining
call, not the one whose values were tested to see if the holder could
fit. Now each hold check only samples the allocators once.
And finally we fix a subtle case where the budget exceeded message can
spuriously trigger in the case where dirtying the freed list created a
new empty block after the holder recorded the amount of space in the
freed block.
Signed-off-by: Zach Brown <zab@versity.com>
Data preallocation attempts to allocate large aligned regions of
extents. It tried to fill the hole around a write offset that
didn't contain an extent. It missed the case where there can be
multiple extents between the start of the region and the hole.
It could try to overwrite these additional existing extents and writes
could return EINVAL.
We fix this by trimming the preallocation to start at the write offset
if there are any extents in the region before the write offset. The
data preallocation test output has to be updated now that allocation
extents won't grow towards the start of the region when there are
existing extents.
Signed-off-by: Zach Brown <zab@versity.com>
Log merge completions were spliced in one server commit. It's possible
to get enough completion work pending that it all can't be completed in
one server commit. Operations fail with ENOSPC and because these
changes can't be unwound cleanly the server asserts.
This allows the completion splicing to break the work up into multiple
commits.
Processing completions in multiple commits means that request creation
can observe the merge status in states that weren't possible before.
Splicing is careful to maintain an elevated nr_complete count while the
client can't get requests because the tree is rebalancing.
Signed-off-by: Zach Brown <zab@versity.com>