Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.
Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.
RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.
Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.
xfstests common/004 now runs because tmpfile is supported.
Signed-off-by: Andy Grover <agrover@versity.com>
Add a new distinguishable return value (ENOBUFS) from allocator for if
the transaction cannot alloc space. This doesn't mean the filesystem is
full -- opening a new transaction may result in forward progress.
Alter fallocate and get_blocks code to check for this err val and retry
with a new transaction. Handling actual ENOSPC can still happen, of
course.
Add counter called "alloc_trans_retry" and increment it from both spots.
Signed-off-by: Andy Grover <agrover@versity.com>
[zab@versity.com: fixed up write_begin error paths]
Add a relatively constrained ioctl that moves extents between regular
files. This is intended to be used by tasks which combine many existing
files into a much larger file without reading and writing all the file
contents.
Signed-off-by: Zach Brown <zab@versity.com>
With many concurrent writers we were seeing excessive commits forced
because it thought the data allocator was running low. The transaction
was checking the raw total_len value in the data_avail alloc_root for
the number of free data blocks. But this read wasn't locked, and
allocators could completely remove a large free extent and then
re-insert a slightly smaller free extent as they perform their
alloction. The transaction could see a temporary very small total_len
and trigger a commit.
Data allocations are serialized by a heavy mutex so we don't want to
have the reader try and use that to see a consistent total_len. Instead
we create a data allocator run-time struct that has a consistent
total_len that is updated after all the extent items are manipulated.
This also gives us a place to put the caller's cached extent so that it
can be included in the total_len, previously it wasn't included in the
free total that the transaction saw.
The file data allocator can then initialize and use this struct instead
of its raw use of the root and cached extent. Then the transaction can
sample its consistent total_len that reflects the root and cached
extent.
A subtle detail is that fallocate can't use _free_data to return an
allocated extent on error to the avail pool. It instead frees into the
data_free pool like normal frees. It doesn't really matter that this
could prematurely drain the avail pool because it's in an error path.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have full precision extents a writer with i_mutex and a page
lock can be modifying large extent items which cover much of the
surrounding pages in the file. Readers can be in a different page with
only the page lock and try to work with extent items as the writer is
deleting and creating them.
We add a per-inode rwsem which just protects file extent item
manipulation. We try to acquire it as close to the item use as possible
in data.c which is the only place we work with file extent items.
This stops rare read corruption we were seeing where get_block in a
reader was racing with extent item deletion in a stager at a further
offset in the file.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly. That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.
By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.
Most of this change is churn from changing allocator function and struct
names.
File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity. All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions. This now means
that fallocate and especially restoring offline extents can use larger
extents. Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.
The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing. The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks. This resulted in a lot of bugs. Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction. We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.
The server now only moves free extents into client allocators when they
fall below a low threshold. This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.
Signed-off-by: Zach Brown <zab@versity.com>
Use the new item cache for all the item work in the fs instead of
calling into the forest of btrees. Most of this is mechanical
conversion from the _forest calls to the _item calls. The item cache
no longer supports the kvec argument for describing values so all the
callers pass in the value pointer and length directly.
The item cache doesn't support saving items as they're deleted and later
restoring them from an error unwinding path. There were only two users
of this. Directory entries can easily guarantee that deletion won't
fail by dirtying the items first in the item cache. Xattr updates were
a little trickier. They can combine dirtying, creating, updating, and
deleting to atomically switch between items that describe different
versions of a multi-item value. This also fixed a bug in the srch
xattrs where replacing an xattr would create a new id for the xattr and
leave existing srch items referencing a now deleted id. Replacing now
reuses the old id.
And finally we add back in the locking and transaction item cache
integration.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce different constants for small and large metadata block
sizes.
The small 4KB size is used for the super block, quorum blocks, and as
the granularity of file data block allocation. The larger 64KB size is
used for the radix, btree, and forest bloom metadata block structures.
The bulk of this are obvious transitions from the old single constant to
the appropriate new constant. But there are a few more involved
changes, though just barely.
The block crc calculation now needs the caller to pass in the size of
the block. The radix function to return free bytes instead returns free
blocks and the caller is responsible for knowing how big its managed
blocks are.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for reporting errors to data waiters via a new
SCOUTFS_IOC_DATA_WAIT_ERR ioctl. This allows waiters to return an error
to readers when staging fails.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
[zab: renamed to data_wait_err, took ino arg]
Signed-off-by: Zach Brown <zab@versity.com>
File data allocations come from radix allocators which are populated by
the server before each client transation. It's possible to fully
consume the data allocator within one transaction if the number of dirty
metadata blocks is kept low. This could result in premature ENOSPC.
This was happening to the archive-light-cycle test. If the transactions
performed by previous tests lined up just right then the creation of the
initial test files could see ENOSPC and cause all sorts of nonsense in
the rest of the test, culminating in cmp commands stuck in offline
waits.
This introduces high and low data allocator water marks for
transactions. The server tries to fill data allocators for each
transaction to the high water mark and the client forces the commit of a
transaction if its data allocator falls below the low water mark.
The archive-light-cycle test now passes easily and we see the
trans_commit_data_alloc_low counter increasing during the test.
Signed-off-by: Zach Brown <zab@versity.com>
An incorrect warning condition was added as fallocate was implemented.
It tried to warn against trying to read from the staging ioctl. But the
staging boolean is set on the inode when the staging ioctl has the inode
mutex. It protects against writes, but page reading doesn't use the
mutex. It's perfectly acceptable for reads to be attempted while the
staging ioctl is busy. We rely on it for a large read to consume
staging being written.
The warning caused reads to fail while the stager ioctl was working.
Typically this would hit read-ahead and just force sync reads. But it
could hit sync reads and cause EIO.
Signed-off-by: Zach Brown <zab@versity.com>
We miscalculated the length of extents to create when initializing
offline extents for setattr_more. We were clamping the extent length in
each packed extent item by the full size of the offline extent, ignoring
the iblock position that we were starting from.
Signed-off-by: Zach Brown <zab@versity.com>
Don't return -ENOENT from fiemap on a file with no extents. The
operation is supposed to succeed with no extents.
Signed-off-by: Zach Brown <zab@versity.com>
The setattr_more ioctl has its own helper for creating uninitialized
extents when we know that there can't be any other existing extents. We
don't have to worry about freeing blocks they might have referenced.
This helper forgot to actually store the modified extents back into
packed extent items after setting extents offline.
Signed-off-by: Zach Brown <zab@versity.com>
Add a bit more tracing to stage, release, and unwritten extent
conversion so we can get a bit more visibility into the threads staging
and releasing.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the allocators that use radix blocks we can remove all
the code that was using btree items to store free block bitmaps.
Signed-off-by: Zach Brown <zab@versity.com>
Convert metadata block and file data extent allocations to use the radix
allocator.
Most of this is simple transitions between types and calls. The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix. We remove the code and fields that were responsible for adding
uninitialized data and metadata.
The rest of the unused block allocator code is only ifdefed out. It'll
be removed in a separate patch to reduce noise here.
Signed-off-by: Zach Brown <zab@versity.com>
The btree forest item storage doesn't have as much item granular state
as the item cache did. The item cache could tell if a cached item was
populated from persistent storage or was created in memory. It could
simply remove created items rather than leaving behind a deletion item.
The cached btree blocks in the btree forest item storage mechanism can't
do this. It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.
This created a problem with the extent storage we were using. The
individual extent items were stored with a key set to the last logical
block of their extent. As extents grew or shrank they often were
deleted and created at different key values during a transaction. In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent. Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.
Streaming writes would operate on O(n) for every extent operation. It
got to be out of hand. This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.
For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.
Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items. The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly. It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.
Previously the client and server would exchange extents with network
messages. Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction. The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.
The server no longer has to manage free extents. It transfers block
bitmap items between trees around commits. All of its extent
manipulation can be removed.
The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.
Signed-off-by: Zach Brown <zab@versity.com>
Convert fs callers to work with the btree forest calls instead of the
lsm item cache calls. This is mostly a mechanical syntax conversion.
The inode dirtying path does now update the item rather than simply
dirtying it.
Signed-off-by: Zach Brown <zab@versity.com>
Use the mount's generated random id in persistent items and the lock
that protects them instead of the assigned node_id.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.
The client code gets some shims to send and receive lock messages to and
from the server. Callers use our lock mode constants instead of the
DLM's.
Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.
The biggest change is in the client lock state machine. Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing. We don't have everything
come through a per-lock work queue. Instead we send requests either
from the blocking lock caller or from a shrink work queue. Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.
The different processing contexts leads to a slightly different lock
life cycle. We refactor and seperate allocation and freeing from
tracking and removing locks in data structures. We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.
Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time. We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.
As of this change the lock setup and destruction paths are a little
wobbly. They'll be shored up as we add lock recovery between the client
and server.
Signed-off-by: Zach Brown <zab@versity.com>
We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish. We add NON_REENTRANT out of an abundance of caution. It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.
Signed-off-by: Zach Brown <zab@versity.com>
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node. Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents. With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.
This adds a simple high water mark after which nodes start returning
free extents to the server. From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.
Signed-off-by: Zach Brown <zab@versity.com>
Our simple transaction machinery causes high commit latencies if we let
too much dirty file data accumulate.
Small files have a natural limit on the amount of dirty data because
they have more dirty items per dirty page. They fill up the single
segment sooner and kick off a commit which finds a relatively small
amount of dirty file data.
But large files can reference quite a lot of dirty data with a small
amount of extent items which don't fill up the transaction's segment.
During large streaming writes we can fill up memory with dirty file data
before filling a segment with mapping extent metadata. This can lead to
high commit latencies when memory is full of dirty file pages.
Regularly kicking off background writeback behind streaming write
positions reduces the amount of dirty data that commits will find and
have to write out.
Signed-off-by: Zach Brown <zab@versity.com>
The previous fallocate and get_block allocators only looked for free
extents larger than the requested allocation size. This prematurely
returns -ENOSPC if a very large allocation is attempted. Some xfstests
stress low free space situations by fallocating almost all the free
space in the volume.
This adds an allocation helper function that finds the biggest free
extent to satisfy an allocation, psosibly after trying to get more free
extents from the server. It looks for previous extents in the index of
extents by length. This builds on the previously added item and extent
_prev operations.
Allocators need to then know the size of the allocation they got instead
of assuming they got what they asked for. The server can also return a
smaller extent so it needs to communicate the extent length, not just
its start.
Signed-off-by: Zach Brown <zab@versity.com>
Add an extent function for iterating backwards through extents. We add
the wrapper and have the extent IO functions call their storage _prev
functions. Data extent IO can now call the new scoutfs_item_prev().
Signed-off-by: Zach Brown <zab@versity.com>
The addition of fallocate() now means that offline extents can be
unwritten and allocated and that extents can now be found outside of
i_size.
Truncating needs to know about the possible flag combinations, writing
preallocation needs to know to update an existing extent or allocate up
to the next extent, get_block can't map unwritten extents for read,
extent conversion needs to also clear offline, and truncate needs to
drop extents outside i_size even if truncating to the existing file
size.
Signed-off-by: Zach Brown <zab@versity.com>
Add an fallocate operation.
This changes the possible combinations of flags in extents and makes it
possible to create extents beyond i_size. This will confuse the rest of
the code in a few places and that will be fixed up next.
Signed-off-by: Zach Brown <zab@versity.com>
The release ioctl forgot to update the inode item after truncating
online block mappings. This meant that the offline block count update
was lost when the inode was evicted and re-read, leading to inconsistent
offline block counts.
Signed-off-by: Zach Brown <zab@versity.com>
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata. The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption. In this case we
continue returning the error that caused us to try and clean up.
Signed-off-by: Zach Brown <zab@versity.com>
This is no longer used now that we allocate large extents for
concurrently extending files by preallocating unwritten extents.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have extents we can address the fragmentation of concurrent
writes with large preallocated unwritten extents instead of trying to
allocate from disjoint free space with cursors.
First we add support for unwritten extents. Truncate needs to make sure
it doesn't treat truncated unwritten blocks as online just because
they're not offline. If we try to write into them we convert them to
written extents. And fiemap needs to flag them as unwritten and be sure
to check for extents past i_size.
Then we allocate unwritten extents only if we're extending a contiguous
file. We try to preallocate the size of the file and cap it to a meg.
This ends up with a power of two progression of preallocation sizes,
which nicely balances extent sizes and wasted allocation as file sizes
increase.
We need to be careful to truncate the preallocated regions if the entire
file is released. We take that as an indication that the user doesn't
want the file consuming any more space.
This removes most of the use of the cursor code. It will be completely
removed in a further patch.
Signed-off-by: Zach Brown <zab@versity.com>
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.
We add a client request to allocate an extent of a given length. The
existing segment alloc and free now work with a segment's worth of
blocks.
The server maintains counters in the super block of free blocks instead
of free segments. We maintain an allocation cursor so that allocation
results tend to cycle through the device. It's stored in the super so
that it is maintained across server instances.
This doesn't remove unused dead code to keep the commit from getting too
noisy. It'll be removed in a future commit.
Signed-off-by: Zach Brown <zab@versity.com>
Store file data mappings and free block ranges in extents instead of in
block mapping items and bitmaps.
This adds the new functionality and refactors the functions that use it.
The old functions are no longer called and we stop at ifdeffing them out
to keep the change small. We'll remove all the dead code in a future
change.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Originally the item interfaces were written with full support for
vectored keys and values. Callers constructed keys and values made up
of header structs and data buffers. Segments supported much larger
values which could span pages when stored in memory.
But over time we've pulled that support back. Keys are described by a
key struct instead of a multi-element kvec. Values are now much smaller
and don't span pages. The item interfaces still use the kvec arrays but
everyone only uses a single element.
So let's make the world a whole lot less awful but having the item
interfaces only supporting a single value buffer specified by a kvec. A
bunch of code disappears and the result is much easier to understand.
Signed-off-by: Zach Brown <zab@versity.com>
Every caller of scoutfs_item_lookup_exact() provided a size that matches
the value buffer. Let's remove the redundant arg and use the value
buffer length as the exact size to match.
Signed-off-by: Zach Brown <zab@versity.com>
There were some mistakes in tracking offline blocks.
Online and offline block counts are meant to only refer to actual data
contents. Sparse blocks in an archived file shouldn't be counted as
offline.
But the code was marking unallocated blocks as offline. This could
corrupt the offline block count if a release extended past i_size and
marked the blocks in the mapping item as offline even though they're
past i_size.
We could have clamped the block walking to not go past i_size. But we
still would have had the problem of having offline blocks track sparse
blocks.
Instead we can fix the problem by only marking blocks offline if they
had allocated blocks. This means that sparse regions are never marked
offline and will always read zeros. Now a release that extends past
i_size will not do anything to the unallocated blocks in the mapping
item past i_size and the offline block count will be consistent.
(Also the 'modified' and 'dirty' booleans were redundant, we only need
one of the two.)
Signed-off-by: Zach Brown <zab@versity.com>
The super info's alloc_rwsem protects the local node free segment and
block bitmap items. The truncate code wasn't holding using the rwsem so
it could race with other local node allocator item users and corrupt the
bitmaps. In the best case this could corrupt structures that trigger
EIO. The corrupt items could also create duplicate block allocations
that clobber each other and corrupt data.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't doing anything with the inode blocks field. We weren't even
initializing it which explains why we'd sometimes see garbage i_blocks
values in scoutfs inodes in segments.
The logical blocks field reflects the contents of the file regardless of
whether its online or not. It's the sum of our online and offline block
tracking.
So we can initialize it to our persistent online and offline counts and
then keep it in sync as blocks are allocated and freed.
Signed-off-by: Zach Brown <zab@versity.com>
We had an excessive number of layers between scoutfs and the dlm code in
the kernel. We had dlmglue, the scoutfs locks, and task refs. Each
layer had structs that track the lifetime of the layer below it. We
were about to add another layer to hold on to locks just a bit longer so
that we can avoid down conversion and transaction commit storms under
contention.
This collapses all those layers into simple state machine in lock.c that
manages the mode of dlm locks on behalf of the file system.
The users of the lock interface are mainly unchanged. We did change
from a heavier trylock to a lighter nonblock lock attempt and have to
change the single rare readpage use. Lock fields change so a few
external users of those fields change.
This not only removes a lot of code it also contains functional
improvements. For example, it can now convert directly to CW locks with
a single lock request instead of having to use two by first converting
to NL.
It introduces the concept of an unlock grace period. Locks won't be
dropped on behalf of other nodes soon after being unlocked so that tasks
have a chance to batch up work before the other node gets a chance.
This can result in two orders of magnitude improvements in the time it
takes to, say, change a set of xattrs on the same file population from
two nodes concurrently.
There are significant changes to trace points, counters, and debug files
that follow the implementation changes.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
We weren't setting the new flag in the mapped buffer head. This tells
the caller that the buffer is newly allocated and needs to be zeroed.
Without this we expose unwritten newly allocated block contents.
fsx found this almost immediately. With this fixed fsx passes.
Signed-off-by: Zach Brown <zab@versity.com>
Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind
as part of normal (not an error) operation. This lets us re-use the
name in an upcoming patch.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>