We first attempt to allocate our large logically contiguous cached
blocks with physically contiguous pages to minimize the impact on the
tlb. When that fails we fall back to vmalloc()ed blocks. Sadly,
high-order page allocation failure is expected and we forgot to provide
the flag that suppresses the page allocation failure message.
Signed-off-by: Zach Brown <zab@versity.com>
We had a bug where mkfs would set a free data blkno allocator bit past
the end of the device. (Just at it, in fact. Those fenceposts.) Add
some checks at mount to make sure that the allocator blkno ranges in the
super don't have obvious mistakes.
Signed-off-by: Zach Brown <zab@versity.com>
Entries in a directory are indexed by the hash of their name. This
introduces a perfectly random access pattern. And this results in a cow
storm as directories get large enough such that the leaf blocks that
store their entries are larger than our commits. Each commit ends up
being full of cowed leaf blocks that contain a single new entry.
The dirent name fingerprints change the dirent key to first start with a
fingerprint of the name. This reduces the scope of hash randomization
from the entire directory to entries with the same fingerprint.
On real customer dir sizes and file names we saw roughly 3x create rate
improvements from being able to create more entries in leaf blocks
within a commit.
Signed-off-by: Zach Brown <zab@versity.com>
The radix allocator no longer uses the block visited bit because it
maintains its own much richer private per-block data stored off the priv
pointer.
Signed-off-by: Zach Brown <zab@versity.com>
This reverts commit 294b6d1f79e6d00ba60e26960c764d10c7f4b8a5.
We had previously seen lock contention between mounts that were either
resolving paths by looking up entries in directories or writing xattrs
in file inodes as they did archiving work.
The previous attempt to avoid this contention was to give each directory
its own inode number allocator which ensured that inodes created for
entries in the directory wouldn't share lock groups with inodes in other
directories.
But this creates the problem of operating on few files per lock for
reasonably small directories. It also creates more server commits as
each new directory gets its inode allocation reservation.
The fix is to have mount-wide seperate allocators for directories and
for everything else. This puts directories and files in seperate groups
and locks, regardless of directory population.
Signed-off-by: Zach Brown <zab@versity.com>
We had switched away from the radix_tree because we were adding a
_block_move call which couldn't fail. We no longer need that call, so
we can go back to storing cached blocks in the radix tree which can use
RCU lookups.
This revert has some conflict resolution around recent commits to add
the IO_BUSY block flag and the switch to _LG_ blocks.
This reverts commit 10205a5670dd96af350cf481a3336817871a9a5b.
Signed-off-by: Zach Brown <zab@versity.com>
The radix allocator has to be careful to not get lost in recursion
trying to allocate metadata blocks for its dirty radix blocks while
allocating metadata blocks for others.
The first pass had used path data structures to record the references to
all the blocks we'd need to modify to reflect the frees and allocations
performed while dirtying radix blocks. Once it had all the path blocks
it moved the old clean blocks into new dirty locations so that the
dirtying couldn't fail.
This had two very bad performance implications. First, it meant that
trying to read clean versions of dirtied trees would always read the old
blocks again because their clean version had been moved to the dirty
version. Typically this wouldn't happen but the server does exactly
this every time it tries to merge freed blocks back into its avail
allocator. This created a significant IO load on the server. Secondly,
that block cache move not being allowed to fail motivated us to move to
a locked rbtree for the block cache instead of the lockless rcu
radix_tree.
This changes the recursion avoidance to use per-block private metadata
to track every block that we allocate and cow rather than move. Each
dirty block knows its parent ref and the blknos it would clear and set.
If dirtying fails we can walk back through all the blocks we dirty and
restore their original references before dropping all the dirty blocks
and returning an error. This lets us get rid of the path structure
entirely and results in a much cleaner system.
This change meant tracking free blocks without clearing them as they're
used to satisfy dirty block allocations. The change now has a cursor
that walks the avail metadata tree without modifying it. While building
this it became clear that tracking the first set bits of refs doesn't
provide any value if we're always searching from a cursor. The cursor
ends up providing the same value of avoiding constantly searching empty
initial bits and refs. Maintaining the first metadata was just
overhead.
Signed-off-by: Zach Brown <zab@versity.com>
The forst code has a hint call to gives iterators a place to start
reading from before they acquire locks. It was checking all the log
trees but it wasn't checking the main fs tree. This happened to be OK
today because we're not yet merging items from the log trees into the
main fs tree, but we don't want to miss them once we do start merging
the trees.
Signed-off-by: Zach Brown <zab@versity.com>
The forest item operations were reading the super block to find the
roots that it should read items from.
This was easiest to implement to start, but it is too expensive. We
have to find the roots for every newly acquired lock and every call to
walk the inode seq indexes.
To avoid all these reads we first send the current stable versions of
the fs and logs btrees roots along with root grants. Then we add a net
command to get the current stable roots from the server. This is used
to refresh the roots if stale blocks are encountered and on the seq
index queries.
Signed-off-by: Zach Brown <zab@versity.com>
The server fills radix allocators for the client to consume while
allocating during a transaction. The radix merge function used to move
an entire radix block at a time. With larger blocks this becomes much
too coarse and can move way too much in one call.
This moves allocator bits a word at a time and more precisely moves the
amount that the caller asked for.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce different constants for small and large metadata block
sizes.
The small 4KB size is used for the super block, quorum blocks, and as
the granularity of file data block allocation. The larger 64KB size is
used for the radix, btree, and forest bloom metadata block structures.
The bulk of this are obvious transitions from the old single constant to
the appropriate new constant. But there are a few more involved
changes, though just barely.
The block crc calculation now needs the caller to pass in the size of
the block. The radix function to return free bytes instead returns free
blocks and the caller is responsible for knowing how big its managed
blocks are.
Signed-off-by: Zach Brown <zab@versity.com>
It used to take significant effort to create very tall btrees because
they only stored small references to large LSM segments. Now they store
all file system metadata and we can easily create sufficiently large
btrees for testing. We don't need the tiny btree option.
Signed-off-by: Zach Brown <zab@versity.com>
There's no users of these variants of _prev and _next so they can be
removed. Support for them was also dropped in the previous reworking of
the internal structure of the btree blocks.
Signed-off-by: Zach Brown <zab@versity.com>
This btree implementation was first built for the relatively light duty
of indexing segments in the LSM item implementation. We're now using it
as the core metadata index. It's already using a lot of cpu to do its
job with small blocks and it only gets more expensive as the block size
increases. These changes reduce the CPU use of working with the btree
block structures.
We use a balanced binary tree to index items by key in the block. This
gives us rare tree balancing cost on insertion and deletion instead of
the memmove overhead of maintaining a dense array of item offsets sorted
by key. The keys are stored in the item struct which are stored in an
array at the front of the block so searching for an item uses contiguous
cachelines.
We add a trailing owner offset to values so that we can iterate through
them. This is used to track space freed up by values instead of paying
the memmove cost of keeping all the values at the end of the block. We
occasionally reclaim the fragmented value free space instead of
splitting the block.
Direct item lookups use a small hash table at the end of the block
which maps offsets to items. It uses linear probing and is guaranteed
to have a light load factor so lookups are very likely to only need
a single cache lookup.
We adjust the watermark for triggering a join from half of a block down
to a quarter. This results in less utilized blocks on average. But it
creates distance between the join and split thresholds so we get less
cpu use from constantly joining and splitting if item populations happen
to hover around the previously shared threshold.
While shifting the implementation we choose not to add support for some
features that no longer make sense. There are no longer callers of
_before and _after, and having synthetic tests to use small btree blocks
no longer makes ense when we can easily create very tall trees. Both
those btree interfaces and the tiny btree block support will be removed.
Signed-off-by: Zach Brown <zab@versity.com>
The btree currently uses variable length big-endian buffers that are
compared with memcmp() as keys. This is a historical relic of the time
when keys could be very large. We had dirent keys that included the
name and manifest entries that included those fs keys.
But now all the btree callers are jumping through hoops to translate
their fs keys into big-endian btree keys. And the memcmp() of the
keys is showing up in profiles.
This makes the btree take native scoutfs_key structs as its key. The
forest callers which are working with fs keys can just pass their keys
straight through. The server btree callers with their private btrees
get key fields definied for their use instead of having individual
big-endian key structs.
A nice side-effect of this is that splitting parents doesn't have to
assume that a maximal key will be inserted by a child split. We can
have more keys in parents and wider trees.
Signed-off-by: Zach Brown <zab@versity.com>
These were used for constructing arrays of string mappings of key
fields. We don't print keys with symbolic strings anymore so we don't
need to maintain these values anymore.
Signed-off-by: Zach Brown <zab@versity.com>
The lock server maintains some items in btrees in the server. It is
usually called by the server core during a commit so it doesn't need to
worry about managing commits. But the lock recovery timeout code
happens in its own async context. It needs to protect the lock_client
item removals with a commit.
This was causing failures during xfstests that simulate node crashes by
unmounting with dm-flakey. Lock recovery would dirty blocks in the
btree writer outside of a commit. The first server commit holder would
find dirty blocks and throw an assertion indicating that someone
modified blocks without holding a commit.
Signed-off-by: Zach Brown <zab@versity.com>
The calls for holding and applying commits in the server are currently
private. The lock server is a server component that has been seperated
out into its own file. Most of the time the server calls it during
commits so the btree changes made in the lock server are protected by
the commits. But there are btree calls in the lock server that happen
outside of calls from the server.
Exporting these calls will let the lock server make all its btree
changes in server commits.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for reporting errors to data waiters via a new
SCOUTFS_IOC_DATA_WAIT_ERR ioctl. This allows waiters to return an error
to readers when staging fails.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
[zab: renamed to data_wait_err, took ino arg]
Signed-off-by: Zach Brown <zab@versity.com>
Item writes are first stored in dirty blocks in the private version of
the mount's log tree. Local readers need to be sure to check the dirty
version of the mount's log tree to make sure that they see the result of
writes. Usually trees are found by walking the log tree items stored in
another btree in the super. The private dirty version of a mount's log
tree hasn't been committed yet and isn't visible in these items.
The forest uses its lock private data to track which lock has seen items
written and so should always check the local dirty log tree when
reading. The intent was to use the per-lock static forest_root for the
log tree to record that it had been marked by a write and was then
always used for reads.
We used storing the forest info's rid and testing for a non-zero
forest_root rid as the mechanism for always testing the dirty log root
during read. But we weren't setting the forest info rid as each
transaction opened. It was always 0 so readers never added the dirty
log tree for reading.
The fix is to use the more reliable indication that the log root has
items for us by testing the flag that all the bits have been set. Then
we're also sure to always set the rid/nr of the forest_info record of
our log tree, and the per-lock forest_root copy of it whenever we use
it.
This fixed spurious errors we were seeing as creates tried to read
the item they just wrote as memory reclaim freed locks.
Signed-off-by: Zach Brown <zab@versity.com>
File data allocations come from radix allocators which are populated by
the server before each client transation. It's possible to fully
consume the data allocator within one transaction if the number of dirty
metadata blocks is kept low. This could result in premature ENOSPC.
This was happening to the archive-light-cycle test. If the transactions
performed by previous tests lined up just right then the creation of the
initial test files could see ENOSPC and cause all sorts of nonsense in
the rest of the test, culminating in cmp commands stuck in offline
waits.
This introduces high and low data allocator water marks for
transactions. The server tries to fill data allocators for each
transaction to the high water mark and the client forces the commit of a
transaction if its data allocator falls below the low water mark.
The archive-light-cycle test now passes easily and we see the
trans_commit_data_alloc_low counter increasing during the test.
Signed-off-by: Zach Brown <zab@versity.com>
The identifier for data.h's include guard was brought over from an old
file and still had the old name. Update it to reflect it's use in data,
not filerw.
Signed-off-by: Zach Brown <zab@versity.com>
When we added the kernelcompat layer around the old and new readdir
interfaces there was some confusion in the old readdir interface filldir
arguments. We were passing in our scoutfs dent item struct pointer
instead of the filldir callback buf pointer. This prevented readdir
from working in older kernels because filldir would immediately see a
corrupt buf and return an error.
This renames the emit compat macro arguments to make them consistent
with the other calls and readdir now provides the correct pointer to the
emit wrapper.
Signed-off-by: Zach Brown <zab@versity.com>
The radix block next bit search could return a spurious -ENOENT if it
ran out of references in a parent block further down the tree. It needs
to bubble up to try the next ref in its parent so that it keeps
performing a depth-first search of the entire tree.
This lead to an assertion being tripped in _radix_merge. Getting an
early -ENOENT caused it to start searching from 0 again. When it's
iterating over a read-only input it could find the same leaf and try to
clear source bits that were already cleared.
Signed-off-by: Zach Brown <zab@versity.com>
Add a bit more detail to the radix merge trace. It was missing the
input block and leaf bit. Also use abbreviations of the fields in the
trace output so that it's slightly less enormous.
Signed-off-by: Zach Brown <zab@versity.com>
The seq portion of radix block references is intended to differentiate
versions of a given block location over time. The current method of
incrementing the existing value as the block is dirtied is risky. It
means that every lineage of a block has the same sequence number
progression. Different trees referencing the same block over time could
get confused. It's more robust to have large random numbers. The
collision window is then evenly distributed over the 64bit space rather
than being bunched up all in in the initial seq values.
Signed-off-by: Zach Brown <zab@versity.com>
When we're merging bits that are set in a read-only input tree then we
can't try to merge more bits than exist in the input tree. That'll
cause us to loop around and double-free bits.
Signed-off-by: Zach Brown <zab@versity.com>
We were using bitmap_xor() to set and clear blocks of allocator bits at
a time. bitmap_xor() is a ternary function with two const input
pointers and we were providing the changing destination as a const input
pointer. That doesn't seem wise.
Signed-off-by: Zach Brown <zab@versity.com>
An incorrect warning condition was added as fallocate was implemented.
It tried to warn against trying to read from the staging ioctl. But the
staging boolean is set on the inode when the staging ioctl has the inode
mutex. It protects against writes, but page reading doesn't use the
mutex. It's perfectly acceptable for reads to be attempted while the
staging ioctl is busy. We rely on it for a large read to consume
staging being written.
The warning caused reads to fail while the stager ioctl was working.
Typically this would hit read-ahead and just force sync reads. But it
could hit sync reads and cause EIO.
Signed-off-by: Zach Brown <zab@versity.com>
Add specific error messages for failures that can happen as the server
commits log trees from the client. These are severe enough that we'd
like to know about them.
Signed-off-by: Zach Brown <zab@versity.com>
Back in ancient LSM times these functions to read and write the super
block reused the bio functions that LSM segment IO used. Each IO would
be performed with privately allocated pages and bios.
When we got rid of the LSM code we got rid of the bio functions. It was
quick and easy to transition super read/write to use buffer_heads. This
introduced sharing of the super's buffer_head between readers and
writers. First we saw concurrent readers being confused by the uptodate
bit and added a bunch of complexity to coordinate use of the uptodate
bit.
Now we're seeing the writer copy its super for writing into the buffer
that readers are using, causing crc failures on read. Let's not use
buffer_heads anymore (always good advice).
We added quick block functions to read and write small blocks with
private pages and bios. Use those here to read and write the super so
that readers and writers operate on their own buffers again.
Signed-off-by: Zach Brown <zab@versity.com>
Add two quick functions which perform IO on small fixed size 4K blocks
to or from the caller's buffer with privately allocated pages and bios.
Callers have no interaction with each other. This matches the behaviour
expected by callers of scoutfs_read_super and _write_super.
Signed-off-by: Zach Brown <zab@versity.com>
We miscalculated the length of extents to create when initializing
offline extents for setattr_more. We were clamping the extent length in
each packed extent item by the full size of the offline extent, ignoring
the iblock position that we were starting from.
Signed-off-by: Zach Brown <zab@versity.com>
With the introduction of packed extent items the setattr_more ioctl had
to be careful not to try and dirty all the extent items in one
transaction. But it pulled the extent creation call up to high and was
doing it before some argument checks that were done after the inode was
refreshed by acquiring its lock. This moves the extent creation to be
done after the args are verified for the inode.
Signed-off-by: Zach Brown <zab@versity.com>
Don't return -ENOENT from fiemap on a file with no extents. The
operation is supposed to succeed with no extents.
Signed-off-by: Zach Brown <zab@versity.com>
The setattr_more ioctl has its own helper for creating uninitialized
extents when we know that there can't be any other existing extents. We
don't have to worry about freeing blocks they might have referenced.
This helper forgot to actually store the modified extents back into
packed extent items after setting extents offline.
Signed-off-by: Zach Brown <zab@versity.com>
Add a bit more tracing to stage, release, and unwritten extent
conversion so we can get a bit more visibility into the threads staging
and releasing.
Signed-off-by: Zach Brown <zab@versity.com>
We need to invalidate old stale blocks we encounter when reading old
bloom block references written by other nodes. This is the same
consistency mechanism used by btree blocks.
Signed-off-by: Zach Brown <zab@versity.com>
A quick update of the comment describing the forest's use of the bloom
filter block. It used to be a tree of bloom filter items.
Signed-off-by: Zach Brown <zab@versity.com>
Remove a bunch of unused counters which have accumulated over time as
we've worked on the code and forgotten to remove counters.
Signed-off-by: Zach Brown <zab@versity.com>
Forest item iteration allocates iterator positions for each tree root
it reads from. The postorder destruction of the iterator nodes wasn't
quite right because we were balancing the nodes as they were freed.
That can change parent/child relationships and cause postorder iteration
to skip some nodes, leaking memory. It would have worked if we just
freed the nodes without using rb_erase to balance.
The fix is to actually iterate over the rbnodes while using the destroy
helper which rebalances as it frees.
Signed-off-by: Zach Brown <zab@versity.com>