scoutfs_data_wait_check_iter() was checking the contiguous region of the
file starting at its pos and extending for iter_iov_count() bytes. The
caller can do that with the previous _data_wait_check() method by
providing the same count that _check_iter() was using.
Signed-off-by: Zach Brown <zab@versity.com>
The aio_read and aio_write callbacks are no longer used by newer
kernels which now uses iter based readers and writers.
We can avoid implementing plain .read and .write as an iter will
be generated when needed for us automatically.
We add a new data_wait_check_iter() function accordingly.
With these methods removed from the kernel, the el8 kernel no
longer uses the extended ops wrapper struct and is much closer now
to upstream. As a result, a lot of methods are moving around from
inode_dir_operations to and from inode_file_operations etc, and
perhaps things will look a bit more structured as a result.
As a result, we need a slightly different data_wait_check() that
accounts for the iter and offset properly.
Signed-off-by: Auke Kok <auke.kok@versity.com>
When we truncate away from a partial block we need to zero its tail that
was past i_size and dirty it so that it's written.
We missed the typical vfs boilerplate of calling block_truncate_page
from setattr->set_size that does this. We need to be a little careful
to pass our file lock down to get_block and then queue the inode for
writeback so its written out with the transaction. This follows the
pattern in .write_end.
Signed-off-by: Zach Brown <zab@versity.com>
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free. This adds support for
returning ENOSPC to client posix allocators as free space gets low.
For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space. The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks. In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing). When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.
Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.
For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.
The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.
We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.
We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.
And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.
Signed-off-by: Zach Brown <zab@versity.com>
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.
Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.
RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.
Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.
xfstests common/004 now runs because tmpfile is supported.
Signed-off-by: Andy Grover <agrover@versity.com>
Add a relatively constrained ioctl that moves extents between regular
files. This is intended to be used by tasks which combine many existing
files into a much larger file without reading and writing all the file
contents.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly. That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.
By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.
Most of this change is churn from changing allocator function and struct
names.
File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity. All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions. This now means
that fallocate and especially restoring offline extents can use larger
extents. Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.
The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing. The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks. This resulted in a lot of bugs. Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction. We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.
The server now only moves free extents into client allocators when they
fall below a low threshold. This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for reporting errors to data waiters via a new
SCOUTFS_IOC_DATA_WAIT_ERR ioctl. This allows waiters to return an error
to readers when staging fails.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
[zab: renamed to data_wait_err, took ino arg]
Signed-off-by: Zach Brown <zab@versity.com>
File data allocations come from radix allocators which are populated by
the server before each client transation. It's possible to fully
consume the data allocator within one transaction if the number of dirty
metadata blocks is kept low. This could result in premature ENOSPC.
This was happening to the archive-light-cycle test. If the transactions
performed by previous tests lined up just right then the creation of the
initial test files could see ENOSPC and cause all sorts of nonsense in
the rest of the test, culminating in cmp commands stuck in offline
waits.
This introduces high and low data allocator water marks for
transactions. The server tries to fill data allocators for each
transaction to the high water mark and the client forces the commit of a
transaction if its data allocator falls below the low water mark.
The archive-light-cycle test now passes easily and we see the
trans_commit_data_alloc_low counter increasing during the test.
Signed-off-by: Zach Brown <zab@versity.com>
The identifier for data.h's include guard was brought over from an old
file and still had the old name. Update it to reflect it's use in data,
not filerw.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the allocators that use radix blocks we can remove all
the code that was using btree items to store free block bitmaps.
Signed-off-by: Zach Brown <zab@versity.com>
Convert metadata block and file data extent allocations to use the radix
allocator.
Most of this is simple transitions between types and calls. The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix. We remove the code and fields that were responsible for adding
uninitialized data and metadata.
The rest of the unused block allocator code is only ifdefed out. It'll
be removed in a separate patch to reduce noise here.
Signed-off-by: Zach Brown <zab@versity.com>
The btree forest item storage doesn't have as much item granular state
as the item cache did. The item cache could tell if a cached item was
populated from persistent storage or was created in memory. It could
simply remove created items rather than leaving behind a deletion item.
The cached btree blocks in the btree forest item storage mechanism can't
do this. It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.
This created a problem with the extent storage we were using. The
individual extent items were stored with a key set to the last logical
block of their extent. As extents grew or shrank they often were
deleted and created at different key values during a transaction. In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent. Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.
Streaming writes would operate on O(n) for every extent operation. It
got to be out of hand. This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.
For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.
Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items. The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly. It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.
Previously the client and server would exchange extents with network
messages. Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction. The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.
The server no longer has to manage free extents. It transfers block
bitmap items between trees around commits. All of its extent
manipulation can be removed.
The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
Add an fallocate operation.
This changes the possible combinations of flags in extents and makes it
possible to create extents beyond i_size. This will confuse the rest of
the code in a few places and that will be fixed up next.
Signed-off-by: Zach Brown <zab@versity.com>
Store file data mappings and free block ranges in extents instead of in
block mapping items and bitmaps.
This adds the new functionality and refactors the functions that use it.
The old functions are no longer called and we stop at ifdeffing them out
to keep the change small. We'll remove all the dead code in a future
change.
Signed-off-by: Zach Brown <zab@versity.com>
Add cluster lock coverage to scoutfs_data_truncate_items() and plumb the
lock down into the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Move to static mapping items instead of unbounded extents.
We get more predictable data structures and simpler code but still get
reasonably dense metadata.
We no longer need all the extent code needed to split and merge extents,
test for overlaps, and all that. The functions that use the mappings
(get_block, fiemap, truncate) now have a pattern where they decode the
mapping item into an allocated native representation, do their work, and
encode the result back into the dense item.
We do have to grow the largest possible item value to fit the worst case
encoding expansion of random block numbers.
The local allocators are no longer two extents but are instead simple
bitmaps: one for full segments and one for individual blocks. There are
helper functions to free and allocate segments and blocks, with careful
coordination of, for example, freeing a segment once all of its
constituent blocks are free.
_fiemap is refactored a bit to make it more clear what's going on.
There's one function that either merges the next bit with the currently
building extent or fills the current and starts recording from a
non-mergable additional block. The old loop worked this way but was
implemented with a single squirrely iteration over the extents. This
wasn't feasible now that we're also iterating over blocks inside the
mapping items. It's a lot clearer to call out to merge or fill the
fiemap entry.
The dirty item reservation counts for using the mappings is reduced
significantly because each modification no longer has to assume that it
might merge with two adjacent contiguous neighbours.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have basic file extents we can add a flag to extents to
track offline extents. We have to initialize and test the flags as we
work with extents. Truncation can be told to leave removed extents
around with no block mapping and the offline bit set. Only staging with
the correct data version can write to the offline regions. Demand
staging isn't implemented yet. Reads from offline extents are treated
like sparse regions.
Truncation is a straight forward iteration over the portions of existing
extents which overlap with the truncated blocks.
Writing to offline extents has to first remove the existing offline
extent before then adding the new allocated extents. The 'changes'
mechanism relied on being able to search the current items to find the
changes that should be made before making any changes. This doesn't
work for finding merge candidates for the new allocated insertion
because the old offline extent change won't have been applied yet. We
replace the change mechanism with straight forward item modification and
unwinding.
The generic block fiemap can't communicate offline extents and iterates
over blocks instead of extents. We add our fiemap that iterates
over extents and sets the 'UNKNOWN' flag on offline extents.
Signed-off-by: Zach Brown <zab@versity.com>
Our first attempt at storing file data put them in items. This was easy
to implement but won't be acceptable in the long term. The cost of the
power of LSM indexing is compaction overhead. That's acceptable for
fine grained metadata but is totally unacceptable for bulk file data.
This switches to storing file data in seperate block allocations which
are referenced by extent items.
The bulk of the change is the mechanics of working with extents. We
have high level callers which add or remove logical extents and then
underlying mechanisms that insert, merge, or split the items that
the extents are stored in.
We have three types of extent items. The primary type maps logical file
regions to physical block extents. The next two store free extents
per-node so that clients don't create lock and LSM contention as they
try and allocate extents.
To fill those per-node free extents we add messages that communcate free
extents in the form of lists of segment allocations from the server.
We don't do any fancy multi-block allocation yet. We only allocate
blocks in get_blocks as writes find unmapped blocks. We do use some
per-task cursors to cache block allocation positions so that these
single block allocations are very likely to merge into larger extents as
tasks stream wites.
This is just the first chunk of the extent work that's coming. A later
patch adds offline flags and fixes up the change nonsense that seemed
like a good idea here.
The final moving part is that we initiate writeback on all newly
allocated extents before we commit the metadata that references the new
blocks. We do this with our own dirty inode tracking because the high
level vfs methods are unusably slow in some upstream kernels (they walk
all inodes, not just dirty inodes.)
Signed-off-by: Zach Brown <zab@versity.com>
Add basic file data support by managing file data items from the page
cache address space callbacks.
Data is read by copying from cached items into page contents in
readpage.
Writes create new ephemeral items which reference dirty pages. The
items are deleted once they're written in a transaction or if
invalidatepage removes the dirty page they reference.
There's a lot more to do to remove data copies, avoid compaction bw
overhead, and add support for truncate, o_direct, and mmap.
Signed-off-by: Zach Brown <zab@versity.com>