We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish. We add NON_REENTRANT out of an abundance of caution. It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.
Signed-off-by: Zach Brown <zab@versity.com>
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node. Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents. With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.
This adds a simple high water mark after which nodes start returning
free extents to the server. From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.
Signed-off-by: Zach Brown <zab@versity.com>
Our simple transaction machinery causes high commit latencies if we let
too much dirty file data accumulate.
Small files have a natural limit on the amount of dirty data because
they have more dirty items per dirty page. They fill up the single
segment sooner and kick off a commit which finds a relatively small
amount of dirty file data.
But large files can reference quite a lot of dirty data with a small
amount of extent items which don't fill up the transaction's segment.
During large streaming writes we can fill up memory with dirty file data
before filling a segment with mapping extent metadata. This can lead to
high commit latencies when memory is full of dirty file pages.
Regularly kicking off background writeback behind streaming write
positions reduces the amount of dirty data that commits will find and
have to write out.
Signed-off-by: Zach Brown <zab@versity.com>
The previous fallocate and get_block allocators only looked for free
extents larger than the requested allocation size. This prematurely
returns -ENOSPC if a very large allocation is attempted. Some xfstests
stress low free space situations by fallocating almost all the free
space in the volume.
This adds an allocation helper function that finds the biggest free
extent to satisfy an allocation, psosibly after trying to get more free
extents from the server. It looks for previous extents in the index of
extents by length. This builds on the previously added item and extent
_prev operations.
Allocators need to then know the size of the allocation they got instead
of assuming they got what they asked for. The server can also return a
smaller extent so it needs to communicate the extent length, not just
its start.
Signed-off-by: Zach Brown <zab@versity.com>
Add an extent function for iterating backwards through extents. We add
the wrapper and have the extent IO functions call their storage _prev
functions. Data extent IO can now call the new scoutfs_item_prev().
Signed-off-by: Zach Brown <zab@versity.com>
The addition of fallocate() now means that offline extents can be
unwritten and allocated and that extents can now be found outside of
i_size.
Truncating needs to know about the possible flag combinations, writing
preallocation needs to know to update an existing extent or allocate up
to the next extent, get_block can't map unwritten extents for read,
extent conversion needs to also clear offline, and truncate needs to
drop extents outside i_size even if truncating to the existing file
size.
Signed-off-by: Zach Brown <zab@versity.com>
Add an fallocate operation.
This changes the possible combinations of flags in extents and makes it
possible to create extents beyond i_size. This will confuse the rest of
the code in a few places and that will be fixed up next.
Signed-off-by: Zach Brown <zab@versity.com>
The release ioctl forgot to update the inode item after truncating
online block mappings. This meant that the offline block count update
was lost when the inode was evicted and re-read, leading to inconsistent
offline block counts.
Signed-off-by: Zach Brown <zab@versity.com>
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata. The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption. In this case we
continue returning the error that caused us to try and clean up.
Signed-off-by: Zach Brown <zab@versity.com>
This is no longer used now that we allocate large extents for
concurrently extending files by preallocating unwritten extents.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have extents we can address the fragmentation of concurrent
writes with large preallocated unwritten extents instead of trying to
allocate from disjoint free space with cursors.
First we add support for unwritten extents. Truncate needs to make sure
it doesn't treat truncated unwritten blocks as online just because
they're not offline. If we try to write into them we convert them to
written extents. And fiemap needs to flag them as unwritten and be sure
to check for extents past i_size.
Then we allocate unwritten extents only if we're extending a contiguous
file. We try to preallocate the size of the file and cap it to a meg.
This ends up with a power of two progression of preallocation sizes,
which nicely balances extent sizes and wasted allocation as file sizes
increase.
We need to be careful to truncate the preallocated regions if the entire
file is released. We take that as an indication that the user doesn't
want the file consuming any more space.
This removes most of the use of the cursor code. It will be completely
removed in a further patch.
Signed-off-by: Zach Brown <zab@versity.com>
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.
We add a client request to allocate an extent of a given length. The
existing segment alloc and free now work with a segment's worth of
blocks.
The server maintains counters in the super block of free blocks instead
of free segments. We maintain an allocation cursor so that allocation
results tend to cycle through the device. It's stored in the super so
that it is maintained across server instances.
This doesn't remove unused dead code to keep the commit from getting too
noisy. It'll be removed in a future commit.
Signed-off-by: Zach Brown <zab@versity.com>
Store file data mappings and free block ranges in extents instead of in
block mapping items and bitmaps.
This adds the new functionality and refactors the functions that use it.
The old functions are no longer called and we stop at ifdeffing them out
to keep the change small. We'll remove all the dead code in a future
change.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Originally the item interfaces were written with full support for
vectored keys and values. Callers constructed keys and values made up
of header structs and data buffers. Segments supported much larger
values which could span pages when stored in memory.
But over time we've pulled that support back. Keys are described by a
key struct instead of a multi-element kvec. Values are now much smaller
and don't span pages. The item interfaces still use the kvec arrays but
everyone only uses a single element.
So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec. A
bunch of code disappears and the result is much easier to understand.
Signed-off-by: Zach Brown <zab@versity.com>
Every caller of scoutfs_item_lookup_exact() provided a size that matches
the value buffer. Let's remove the redundant arg and use the value
buffer length as the exact size to match.
Signed-off-by: Zach Brown <zab@versity.com>
There were some mistakes in tracking offline blocks.
Online and offline block counts are meant to only refer to actual data
contents. Sparse blocks in an archived file shouldn't be counted as
offline.
But the code was marking unallocated blocks as offline. This could
corrupt the offline block count if a release extended past i_size and
marked the blocks in the mapping item as offline even though they're
past i_size.
We could have clamped the block walking to not go past i_size. But we
still would have had the problem of having offline blocks track sparse
blocks.
Instead we can fix the problem by only marking blocks offline if they
had allocated blocks. This means that sparse regions are never marked
offline and will always read zeros. Now a release that extends past
i_size will not do anything to the unallocated blocks in the mapping
item past i_size and the offline block count will be consistent.
(Also the 'modified' and 'dirty' booleans were redundant, we only need
one of the two.)
Signed-off-by: Zach Brown <zab@versity.com>
The super info's alloc_rwsem protects the local node free segment and
block bitmap items. The truncate code wasn't holding using the rwsem so
it could race with other local node allocator item users and corrupt the
bitmaps. In the best case this could corrupt structures that trigger
EIO. The corrupt items could also create duplicate block allocations
that clobber each other and corrupt data.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't doing anything with the inode blocks field. We weren't even
initializing it which explains why we'd sometimes see garbage i_blocks
values in scoutfs inodes in segments.
The logical blocks field reflects the contents of the file regardless of
whether its online or not. It's the sum of our online and offline block
tracking.
So we can initialize it to our persistent online and offline counts and
then keep it in sync as blocks are allocated and freed.
Signed-off-by: Zach Brown <zab@versity.com>
We had an excessive number of layers between scoutfs and the dlm code in
the kernel. We had dlmglue, the scoutfs locks, and task refs. Each
layer had structs that track the lifetime of the layer below it. We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.
This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.
The users of the lock interface are mainly unchanged. We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use. Lock fields change so a few
external users of those fields change.
This not only removes a lot of code it also contains functional
improvements. For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.
It introduces the concept of an unlock grace period. Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.
There are significant changes to trace points, counters, and debug files
that follow the implementation changes.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't setting the new flag in the mapped buffer head. This tells
the caller that the buffer is newly allocated and needs to be zeroed.
Without this we expose unwritten newly allocated block contents.
fsx found this almost immediately. With this fixed fsx passes.
Signed-off-by: Zach Brown <zab@versity.com>
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
scoutfs_item_create() hasn't been working with lock coverage. It
wouldn't return -ENOENT if it didn't have the lock cached. It would
create items outside lock coverate so they wouldn't be invalidated and
re-read if another node modified the item.
Add a lock arg and teach it to populate the cache so that it's correctly
consistent.
Signed-off-by: Zach Brown <zab@versity.com>
Add lock coverage for inode index items.
Sadly, this isn't trivial. We have to predict the value of the indexed
fields before the operation to lock those items. One value in
particular we can't reliably predict: the sequence of the transaction we
enter after locking. Also operations can create an absolute ton of
index item updates -- rename can modify nr_inodes * items_per_inode * 2
items, so maybe 24 today. And these items can be arbitrarily positioned
in the key space.
So to handle all this we add functions to gather predicted item values
we'll need to lock sort and lock them all, then pass appropriate locks
down to the item functions during inode updates.
The trickiest bit of the index locking code is having to retry if the
sequence number changes. Preparing locks has to guess the sequence
number of its upcoming trans and then makes item update decisions based
on that. If we enter and have a different sequence number then we need
to back off and retry with the correct sequence number (we may find that
we'll need to update the indexed meta seq and need to have it locked).
The use of the functions is straight forward. Sites figure out the
predicted sizes, lock, pass the locks to inode updates, and unlock.
While we're at it we replace the individual item field tracking
variables in the inode info with an array of indexed values. The code
ends up a bit nicer. It also gets rid of the indexed time fields that
were left behind and were unused.
It's worth noting that we're getting exclusive locks on the index
updates. Locking the meta/data seq updates results in complete global
serialization of all changes. We'll need concurrent writer locks to get
concurrency back.
Signed-off-by: Zach Brown <zab@versity.com>
Use per_task storage on the inode to pass locks from high level read and
write lock holders down into the callbacks that operate under the locks
so that the locks can then be passed to the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Add a full lock argument to scoutfs_update_inode_item() and use it to
pass the lock's end key into item_update(). This'll get changed into
passing the full lock into _update soon.
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_dirty() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper
around _item_dirty();
Signed-off-by: Zach Brown <zab@versity.com>
Add the full lock argument to _item_next*() so that it can verify lock
coverage in addition to limiting item cache population to the range
covered by the lock.
Signed-off-by: Zach Brown <zab@versity.com>
Add cluster lock coverage to scoutfs_data_truncate_items() and plumb the
lock down into the item functions.
Signed-off-by: Zach Brown <zab@versity.com>
Some versions of gcc correctly noticed that bulk_alloc() had a
case where it wouldn't initialize ret if the first segno was 0. This
won't happen because the client response processing returns an error in
this case. So this just shuts up the warning.
Signed-off-by: Zach Brown <zab@versity.com>
Move to static mapping items instead of unbounded extents.
We get more predictable data structures and simpler code but still get
reasonably dense metadata.
We no longer need all the extent code needed to split and merge extents,
test for overlaps, and all that. The functions that use the mappings
(get_block, fiemap, truncate) now have a pattern where they decode the
mapping item into an allocated native representation, do their work, and
encode the result back into the dense item.
We do have to grow the largest possible item value to fit the worst case
encoding expansion of random block numbers.
The local allocators are no longer two extents but are instead simple
bitmaps: one for full segments and one for individual blocks. There are
helper functions to free and allocate segments and blocks, with careful
coordination of, for example, freeing a segment once all of its
constituent blocks are free.
_fiemap is refactored a bit to make it more clear what's going on.
There's one function that either merges the next bit with the currently
building extent or fills the current and starts recording from a
non-mergable additional block. The old loop worked this way but was
implemented with a single squirrely iteration over the extents. This
wasn't feasible now that we're also iterating over blocks inside the
mapping items. It's a lot clearer to call out to merge or fill the
fiemap entry.
The dirty item reservation counts for using the mappings is reduced
significantly because each modification no longer has to assume that it
might merge with two adjacent contiguous neighbours.
Signed-off-by: Zach Brown <zab@versity.com>
The item count estimate functions didn't obviously differentiate between
adding to a count and resetting it. Most callers initialized the count
struct to 0 on the stack, incremented their estimate once, and passed it
in. The problem is that those same functions that increment once in
callers are also used in other estimates to build counts based on
multiple operations.
This tripped up the data truncate path. It looped and kept incrementing
its count while truncating a file with lots of extents until the count
got so large that it didn't fit in a segment by itself and blocked
forever.
This cleans up the item count code so that it's much harder to get
wrong. We differentiate between the SIC_*() high level count estimates
that are meant to be passed in to _hold_trans(), and the internal
__count_*() functions which are used to add up the item counts that make
up an aggregate operation.
With this fix the only way to use the count in extent truncation is to
correctly reset it for the item count for each transacation.
Signed-off-by: Zach Brown <zab@versity.com>
We were seeing __block_write_begin spin when staging writes were called
after a read of an offline region saw an error. It turns out that
the way __block_write_begin iterates through buffer heads on a page will
livelock if b_size is 0.
Our get_block was clearing b_blocknr and b_size before doing anything.
It'd set them when it allocated blocks or found existing mapped
blocks. But it'd leave them 0 on an error and trigger this hang.
So we'll back off and only do the same things to the result bh that
ext2/3 do, presumably that's what's actually supported. We only set
mapped, set or clear new, and set b_size to less than the input b_size.
While we're at it we remove a totally bogus extent flag check that's
done before seeing if the next extent we found even intersects with the
logical block that we're searching for. The extra test is performed
again correctly inside the check for the extents overlapping. It is an
artifact from the days when the "extents" were a single block and didn't
need to check for overlaps.
Signed-off-by: Zach Brown <zab@versity.com>
Without this we return -ESPIPE when a process tries to seek on a regular
file.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to new lock call]
Signed-off-by: Zach Brown <zab@zabbo.net>
With trylock implemented we can add locking in readpage. After that it's
pretty easy to implement our own read/write functions which at this
point are more or less wrapping the kernel helpers in the correct
cluster locking.
Data invalidation is a bit interesting. If the lock we are invalidating
is an inode group lock, we use the lock boundaries to incrementally
search our inode cache. When an inode struct is found, we sync and
(optionally) truncate pages.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
[zab: adapted to newer lock call, fixed some error handling]
Signed-off-by: Zach Brown <zab@versity.com>
This is based on Mark Fasheh <mfasheh@versity.com>'s series that
introduced inode refreshing after locking and a trylock for readpage.
Rework the inode locking function so that it's more clearly named and
takes flags and the inode struct.
We have callers that want to lock the logical inode but aren't doing
anything with the vfs inode so we provide that specific entry point.
Signed-off-by: Zach Brown <zab@versity.com>
We move struct ocfs2_lock_res_ops and flags to dlmglue.c so that
locks.c can get access to it. Similarly, we export
ocfs2_lock_res_init_common() for locks.c can initialize each lockres
before use. Also, free_lock_tree() now has to happen before we shut
down the dlm - this gives dlmglue the opportunity to unlock their
underlying dlm locks before we go off freeing the structures.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
Truncation updates extents that intersect with the input range. It
starts with the first block in the range and iterates until it has
searched for all the extents that could cover the range.
Extents are stored in items at their final block location so that we can
use _next to find intersections. Truncation was searching for the next
extent after the full extent that it was still searching for. That
means it was starting the search at the last block in the extent, not
the first. It would miss all the extents that didn't overlap with the
last block it was searching for.
This fixed by searching from a temporary single block extent at the
start of the search range.
Signed-off-by: Zach Brown <zab@versity.com>
Offline extents weren't being merged because they all had their physical
blkno set to 0 and all the extent calculations didn't treat them
specially. They would only merge if the physical blocks of two extent
were contiguous. Instead of special casing offline extents everywhere
we store them with a physical blkno set to the logical blk_off. This
lets all the current extent calculations work as expected.
Signed-off-by: Zach Brown <zab@versity.com>
Release tries to re-instate extents if it sees an error during release.
Those item manipulations need to be covered by the transaction.
Signed-off-by: Zach Brown <zab@versity.com>
The networking code was really suffering by trying to combine the client
and server processing paths into one file. The code can be a lot
simpler by giving the client and server their own processing paths that
take their different socket lifecysles into account.
The client maintains a single connection. Blocked senders work on the
socket under a sending mutex. The recv path runs in work that can be
canceled after first shutting down the socket.
A long running server work function acquires the listener lock, manages
the listening socket, and accepts new sockets. Each accepted socket has
a single recv work blocked waiting for requests. That then spawns
concurrent processing work which sends replies under a sending mutex.
All of this is torn down by shutting down sockets and canceling work
which frees its context.
All this restructuring makes it a lot easier to track what is happening
in mount and unmount between the client and server. This fixes bugs
where unmount was failing because the monolithic socket shutdown
function was queueing other work while running while draining.
Signed-off-by: Zach Brown <zab@versity.com>