Compacting sorted srch files can take multiple transactions because they
can be very large. Each transaction resumes at a byte offset in a block
where the previous transaction stopped.
The resuming code tests that the byte offsets are sane but had a mistake
in testing the offset to skip to. It returned an error if the
compaction resumed from the last possible safe offset for decoding
entries.
If a system is unlucky enough to have a compaction transaction stop at
just this offset then compaction stops making forward progress as each
attempt to resume returns an error.
The fix allows continuation from this last safe offset while returning
errors for attempts to continue *past* that offset. This matches all
the encoding code which allows encoding the last entry in the block at
this offset.
Signed-off-by: Zach Brown <zab@versity.com>
Add a test for srch compaction getting stuck hitting errors continuing a
partial operation. It ensures that a block has an encoded entry at
the _SAFE_BYTES offset, that an operaton stops precisely at that
offset, and then watches for errors.
Signed-off-by: Zach Brown <zab@versity.com>
The srch compaction request building function and the srch compaction
worker both have logic to recognize a valid response with no input files
indicating that there's no work to do. The server unfortunately
translated nr == 0 into ENOENT and send that error response to the
client. This caused the client to increment error counters in the
common case when there's no compaction work to perform. We'd like the
error counter to reflect actual errors, we're about to check it in a
test, so let's fix this up to the server sends a sucessful response with
nr == 0 to indicate that there's no work to do.
Signed-off-by: Zach Brown <zab@versity.com>
The server had a few lower level seqcounts that it used to protect
state. One user got it wrong by forgetting to disable pre-emption
around writers. Debug kernels warned as write_seqcount_begin() was
called without preemption disabled.
We fix that user and make it easier to get right in the future by having
one higher level seqlock and using that consistently for seq read
begin/retry and write lock/unlock patterns.
Signed-off-by: Zach Brown <zab@versity.com>
The rpmbuild support files no longer define the previously used kernel
module macros. This carves out the differences between el7 and el8 with
conditionals based on the distro we are building for.
Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>
In rhel7 this is a nested struct with ktime_t. However, in rhel8
ktime_t is a simple s64, and not a union, and thus we can't do
this as easily. Just memset it.
Signed-off-by: Auke Kok <auke.kok@versity.com>
New kernels expect to do a partial match when a .prefix is used here,
and provide a .name member in case matching should look at the whole
string. This is what we want.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The caller takes care of caching for us. Us doing caching
messes with memory management of cached ACLs and breaks.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The aio_read and aio_write callbacks are no longer used by newer
kernels which now uses iter based readers and writers.
We can avoid implementing plain .read and .write as an iter will
be generated when needed for us automatically.
We add a new data_wait_check_iter() function accordingly.
With these methods removed from the kernel, the el8 kernel no
longer uses the extended ops wrapper struct and is much closer now
to upstream. As a result, a lot of methods are moving around from
inode_dir_operations to and from inode_file_operations etc, and
perhaps things will look a bit more structured as a result.
As a result, we need a slightly different data_wait_check() that
accounts for the iter and offset properly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
.readpages is obsolete in el8 kernels. We implement the .readahead
method instead which is passed a struct readahead_control. We use
the readahead_page(rac) accessor to retrieve page by page from the
struct.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v4.9-12228-g530e9b76ae8f Drops all (un)register_(hot)cpu_notifier()
API functions. From here on we need to use the new cpuhp_* API.
We avoid this entirely for now, at the cost of leaking pages until
the filesystem is unmounted.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Convert the timeout struct unto a u64 nsecs value before passing it to
the trace point event, as to not overflow the 64bit limitation on args.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v4.16-rc1-1-g9b2c45d479d0
This interface now returns (sizeof (addr)) on success, instead of 0.
Therefore, we have to change the error condition detection.
The compat for older kernels handles the addrlen check internally.
Signed-off-by: Auke Kok <auke.kok@versity.com>
MS_* flags from <linux/mount.h> should not be used in the kernel
anymore from 4.x onwards. Instead, we need to use the SB_* versions
Signed-off-by: Auke Kok <auke.kok@versity.com>
Move to the more recent interfaces for counting and scanning cached
objects to shrink.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
Move towards modern bio intefaces, while unfortunately carrying along a
bunch of compat functions that let us still work with the old
incompatible interfaces.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
memalloc_nofs_save() was introduced as preferential to trying to use GFP
flags to indicate that a task should not recurse during reclaim. We use
it instead of the _noio_ we were using before.
Signed-off-by: Zach Brown <zab@versity.com>
__percpu_counter_add_batch was renamed to make it clear that the __
doesn't mean it's less safe, as it means in other calls in the API, but
just that it takes an additional parameter.
Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Auke Kok <auke.kok@versity.com>
There are new interfaces available but the old one has been retained
for us to use. In case of older kernels, we will need to fall back
to the previous name of these functions.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Provide fallback in degraded mode for kernels pre-v4.15-rc3 by directly
manipulating the member as needed.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Since v4.6-rc3-27-g9902af79c01a, inode->i_mutex has been replaced
with ->i_rwsem. However, long since whenever, inode_lock() and
related functions already worked as intended and provided fully
exclusive locking to the inode.
To avoid a name clash on pre-rhel8 kernels, we have to rename a
stack variable in `src/file.c`.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Since v4.15-rc3-4-gae5e165d855d, <linux/iversion.h> contains a new
inode->i_version API and it is not included by default.
Signed-off-by: Auke Kok <auke.kok@versity.com>
The new variant of the code that recomputes the augmented value
is designed to handle non-scalar types and to facilitate that, it
has new semantics for the _compute callback. It is now passed a
boolean flag `exit` that indicates that if the value isn't changed,
it should exit and halt propagation.
The callback function now shall return whether that propagation should
stop or not, and not the computed new value. The callback can now
directly update the new computed value in the node.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Fixes: Error: implicit declaration of function ‘blkdev_put’
Previously this was an `extern` in <fs.h> and included implicitly,
hence the need to hard include it now.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v4.1-rc4-22-g92cf211874e9 merges this into preempt.h, and on
rhel7 kernels we don't need this include anymore either.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v3.15-rc1-6-g1a56f2aa4752 removes flush_work_sync entirely, but
ever since v3.6-rc1-25-g606a5020b9bd which made all workqueues
non-reentrant, it has been equivalent to flush_work.
This is safe because in all cases only one server->work can be
in flight at a time.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v3.18-rc3-2-g230fa253df63 forces us to remove ACCESS_ONCE() with
READ_ONCE(), but it is probably the better interface and works with
non-scalar types.
Signed-off-by: Auke Kok <auke.kok@versity.com>
PAGE_CACHE_SIZE was previously defined to be equivalent to PAGE_SIZE.
This symbol was removed in v4.6-rc1-32-g1fa64f198b9f.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Because we `-include src/kernelcompat.h` from the command line,
this header gets included before any of the kernel includes in
most .c and .h files. We should at least make sure we pull in
<fs> and <kernel> since they're required.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Server code that wants to dirty blocks by holding a commit won't be
allowed to until the current allocators for the server transaction have
enough space for the holder. As an active holder applies the commit the
allocators are refilled and the waiting holders will proceed.
But the current allocators can have no resources as the server starts
up. There will never be active holders to apply the commit and refill
the allocators. In this case all the holders will block indefinitely.
The fix is to trigger a server commit when a holder doesn't have room.
It used to be that commits were only triggered when apply callers were
waiting. We transfer some of that logic into a new 'committing' field
so that we can have commits in flight without apply callers waiting. We
add it to the server commit tracing.
While we're at it we clean up the logic that tests if a hold can
proceed. It used to be confusingly split across two functions that both
could sample the current allocator space remaining. This could lead to
weird cases where the first holder could use the second alloc remaining
call, not the one whose values were tested to see if the holder could
fit. Now each hold check only samples the allocators once.
And finally we fix a subtle case where the budget exceeded message can
spuriously trigger in the case where dirtying the freed list created a
new empty block after the holder recorded the amount of space in the
freed block.
Signed-off-by: Zach Brown <zab@versity.com>
Data preallocation attempts to allocate large aligned regions of
extents. It tried to fill the hole around a write offset that
didn't contain an extent. It missed the case where there can be
multiple extents between the start of the region and the hole.
It could try to overwrite these additional existing extents and writes
could return EINVAL.
We fix this by trimming the preallocation to start at the write offset
if there are any extents in the region before the write offset. The
data preallocation test output has to be updated now that allocation
extents won't grow towards the start of the region when there are
existing extents.
Signed-off-by: Zach Brown <zab@versity.com>
Log merge completions were spliced in one server commit. It's possible
to get enough completion work pending that it all can't be completed in
one server commit. Operations fail with ENOSPC and because these
changes can't be unwound cleanly the server asserts.
This allows the completion splicing to break the work up into multiple
commits.
Processing completions in multiple commits means that request creation
can observe the merge status in states that weren't possible before.
Splicing is careful to maintain an elevated nr_complete count while the
client can't get requests because the tree is rebalancing.
Signed-off-by: Zach Brown <zab@versity.com>
The move_blocks ioctl finds extents to move in the source file by
searching from the starting block offset of the region to move.
Logically, this is fine. After each extent item is deleted the next
search will find the next extent.
The problem is that deleted items still exist in the item cache. The
next iteration has to skip over all the deleted extents from the start
of the region. This is fine with large extents, but with heavily
fragmented extents this creates a huge amplification of the number of
items to traverse when moving the fragmented extents in a large file.
(It's not quite O(n^2)/2 for the total extents, deleted items are purged
as we write out the dirty items in each transaction.. but it's still
immense.)
The fix is to simply start searching for the next extent after the one
we just moved.
Signed-off-by: Zach Brown <zab@versity.com>
If the _contig_only option isn't set then we try to preallocate aligned
regions of files. The initial implementation naively only allowed one
preallocation attempt in each aligned region. If it got a small
allocation that didn't fill the region then every future allocation
in the region would be a single block.
This changes every preallocation in the region to attempt to fill the
hole in the region that iblock fell in. It uses an extra extent search
(item cache search) to try and avoid thousands of single block
allocations.
Signed-off-by: Zach Brown <zab@versity.com>