Prefer named to anonymous enums. This helps readability a little.
Use enum as param type if possible (a couple spots).
Remove unused enum in lock_server.c.
Define enum spbm_flags using shift notation for consistency.
Rename get_file_block()'s "gfb" parameter to "flags" for consistency.
Signed-off-by: Andy Grover <agrover@versity.com>
Update the README.md introduction to scoutfs to mention the need for and
use of metadata and data block devices.
Signed-off-by: Zach Brown <zab@versity.com>
Require a second path to metadata bdev be given via mount option.
Verify meta sb matches sb also written to data sb. Change code as needed
in super.c to allow both to be read. Remove check for overlapping
meta and data blknos, since they are now on entirely separate bdevs.
Use meta_bdev for superblock, quorum, and block.c reads and writes.
Signed-off-by: Andy Grover <agrover@versity.com>
Write locks are given an increasing version number as they're granted
which makes its way into items in the log btrees and is used to find the
most recent version of an item.
The initialization of the lock server's next write_version for granted
locks dates back to the initial prototype of the forest of log btrees.
It is only initialized to zero as the module is loaded. This means that
reloading the module, perhaps by rebooting, resets all the item versions
to 0 and can lead to newly written items being ignored in favour of
older existing items with greater versions from a previous mount.
To fix this we initialize the lock server's write_version to the
greatest of all the versions in items in log btrees. We add a field to
the log_trees struct which records the greatest version which is
maintained as we write out items in transactions. These are read by the
server as it starts.
Then lock recovery needs to include the write_version so that the
lock_server can be sure to set the next write_version past the greatest
version in the currently granted locks.
Signed-off-by: Zach Brown <zab@versity.com>
The log_trees structs store the data that is used by client commits.
The primary struct is communicated over the wire so it includes the rid
and nr that identify the log. The _val struct was stored in btree item
values and was missing the rid and nr because those were stored in the
item's key.
It's madness to duplicate the entire struct just to shave off those two
fields. We can remove the _val struct and store the main struct in item
values, including the rid and nr.
Signed-off-by: Zach Brown <zab@versity.com>
Audit code for structs allocated on stack without initialization, or
using kmalloc() instead of kzalloc().
- avl.c: zero padding in avl_node on insert.
- btree.c: Verify item padding is zero, or WARN_ONCE.
- inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding.
- net.c: zero pad in net header.
- net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin().
- xattr.c: scoutfs_xattr has padding, zero it.
- forest.c: item_root in forest_next_hint() appears to either be
assigned-to or unused, so no need to zero it.
- key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones}
Signed-off-by: Andy Grover <agrover@versity.com>
Instead, explicitly add padding field, and adjust member ordering to
eliminate compiler-added padding between members, and at the end of the
struct (if possible: some structs end in a u8[0] array.)
This should prevent unaligned accesses. Not a big deal on x86_64, but
other archs like aarch64 really want this.
Signed-off-by: Andy Grover <agrover@versity.com>
This will ensure structs, which are internally 8 byte aligned, will remain
so when in the item cache.
16 bytes alignment doesn't seem like it's needed so just do 8.
Signed-off-by: Andy Grover <agrover@versity.com>
We were using a trailing owner offset to iterate over btree item values
from the back of the block towards the front. We did this to reclaim
fragmented free space in a block to satisfy an allocation instead of
having to split the block, which is expensive mostly because it has to
allocate and free metadata blocks.
In the before times, we used to compact items by sorting items by their
offset, moving them, and then sorting them by their keys again. The
sorting by keys was expensive so we added these owner offsets to be able
to compact without sorting.
But the complexity of maintaining the owner metadata is not worth it.
We can avoid the expensive sorting by keys by allocating a temporary
array of item offsets and sorting only it by the value offset. That's
nice and quick, it was the key comparisons that were expensive. Then we
can remove the owner offset entirely, as well as the block header final
free region that compaction needed.
And we also don't compact as often in the modern era because we do the
bulk of our work in the item cache instead of in the btree, and we've
changed the split/merge/compaction heuristics to avoid constantly
splitting/merging/comapcting and an item population happens to hover
right around a shared threshold.
Signed-off-by: Zach Brown <zab@versity.com>
Remove the old superblock fields which were used to track free blocks
found in the radix allocators. We now walk all the allocators when we
need to know the free totals, rather than trying to keep fields in sync.
Signed-off-by: Zach Brown <zab@versity.com>
Before the introduction of the AVL tree to sort btree items, the items
were sorted by sorting a small packed array of offsets. The final
offset in that array pointed to the item in the block with the greatest
key.
With the move to sorting items in an AVL tree by nodes embedded in item
structs, we now don't have the array of offsets and instead have a dense
array of items. Creation and deletion of items always works with the
final item in the array.
last_item() used to return the item with the greatest key by returning
the item pointed to by the final entry in the sorted offset array, then
it returned the final entry in the item array for creation and deletion
but that was no longer the item with the greatest key.
But spliting and joining still used last_item() to find the item in the
block with the greatest key for updating references to blocks in
parents. Since the introduction of the AVL tree splitting and joining
has been corrrupting the tree by setting parent block reference keys to
whatever item happened to be at the end of the array, not the item with
the greatest key.
The extent code recently pushed hard enough to hit this by working with
relatively random extent items in the core allocation btrees.
Eventually the parent block reference keys got out of sync and we'd fail
to find items by descending into the wrong children when looking for
them. Extent deletion hit this during allocation, returned -ENOENT, and
the allocator turned that into -ENOSPC.
With this fixed we can repetedly create and delte millions of files with
heavily fragmented extents in a tiny metadata device. Eventually it
actually runs out of space instead of spuriously returning ENOSPC in a
matter of minutes.
Signed-off-by: Zach Brown <zab@versity.com>
With the introduction of incremental srch file compaction we added some
fields to the srch_compact struct to record the position of compaction
in each file. This increased the size of the struct past the limit the
btree places on the size of item values.
We decrease the number of files per compaction from 8 to 4 to cut the
size of the srch_compcat struct in half. This compacts twice as often,
but still relatively infrequently, and it uses half the space for srch
files waiting to hit the compaction threshold.
Signed-off-by: Zach Brown <zab@versity.com>
Previously the srch compaction work would output the entire compacted
file and delete the input files in one atomic commit. The server would
send the input files and an allocator to the client, and the client
would send back an output file and an allocator that included the
deletion of the input files. The server would merge in the allocator
and replace the input file items with the output file item.
Doing it this way required giving an enormous allocation pool to the
client in a radix, which would deal with recursive operations
(allocating from and freeing to the radix that is being modified). We
no longer have the radix allocator, and we use single block avail/free
lists instead of recursively modifying the btrees with free extent
items. The compaction RPC needs to work with a finite amount of
allocator resources that can be stored in an alloc list block.
The compaction work now does a fixed amount of work and a compaction
operation spans multiple work iterations.
A single compaction struct is now sent between the client and server in
the get_compact and commit_compact messages. The client records any
partial progress in the struct. The server writes that position into
PENDING items. It first searchs for pending items to give to clients
before searching for files to start a new compaction operation.
The compact struct has flags to indicate whether the output file is
being written or the input files are being deleted. The server manages
the flags and sets the input file deletion flag only once the result of
the compaction has been reflected in the btree items which record srch
files.
We added the progress fields to the compaction struct, making it even
bigger than it already was, so we take the time to allocate them rather
than declaring them on the stack.
It's worth mentioning that each operation now takes a reasonably bounded
amount of time will make it feasible to decide that it has failed and
needs to be fenced.
Signed-off-by: Zach Brown <zab@versity.com>
The total_{meta,data}_blocks scoutfs_super_block fields initialized by
mkfs aren't visible to userspace anywhere. Add them to statfs_more so
that tools can get the totals (and use them for df, in this particular
case).
Signed-off-by: Zach Brown <zab@versity.com>
Remove the statfs RPC from the client and server now that we're using
allocator iteration to calculate free blocks.
Signed-off-by: Zach Brown <zab@versity.com>
Use alloc_foreach to count the free blocks in all the allocators instead
of sending an RPC to the server. We cache the results so that constant
df calls don't generate a constant stream of IO.
Signed-off-by: Zach Brown <zab@versity.com>
An an ioctl which copies details of each persistent allocator to
userspace. This will be used by a scoutfs command to give information
about the allocators in the system.
Signed-off-by: Zach Brown <zab@versity.com>
Add an alloc call which reads all the persistent allocators and calls a
callback for each. This is going to be used to calculate free blocks
in clients for df, and in an ioctl to give a more detailed view of
allocators.
Signed-off-by: Zach Brown <zab@versity.com>
The algorithm for choosing the split key assumed that there were
multiple items in the page. That wasn't always true and it could result
in choosing the first item as the split key, which could end up
decrementing the left page's end key before it's start key.
We've since added compaction to the paths that split pages so we now
guarantee that we have at least two items in the page being split. With
that we can be sure to use the second item's key and ensure that we're
never creating invalid keys for the pages created by the split.
Signed-off-by: Zach Brown <zab@versity.com>
The tests for the various page range intersections were out of order.
The edge overlap case could trigger before the bisection case and we'd
fail to remove the initial items in the page. That would leave items
before the start key which would later be used as a midpoint for a
split, causing all kinds of chaos.
Rework the cases so that the overlap cases are last. The unique bisect
case will be caught before we can mistake it for an edge overlap case.
And minimize the number of comparisons we calculate by storing the
handful that all the cases need.
Signed-off-by: Zach Brown <zab@versity.com>
The first pass of the item cache didn't try to reclaim freed space at
all. It would leave behind very sparse pages. The oldest of which
would be reclaimed by memory pressure.
While this worked, it created much more stress on the system than is
necessary. Splitting a page with one key also makes it hard to
calculate the boundaries of the split pages, given that the start and
end keys could be the single item.
This adds a header field which tracks the free space in item cache
pgaes. Free space is created before the alloc offset by removing items
from the rbtree, but also from shrinking item values when updating or
deleting items.
If we try to split a page with sufficient free space to insert the
largest possible item then we compact the page instead of splitting it.
We copy the items into the front of an unused page and swap the pages.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quick function that walks the rbtree and makes sure it doesn't see
any obvious key errors. This is far too expensive to use regularly but
it's handy to have around and add calls to when debugging.
Signed-off-by: Zach Brown <zab@versity.com>
The xattr item stream is constructred from a large contiguous region
that contains the struct header, the key, and the value. The value
can be larger than a page so kmalloc is likely to fail as the system
gets fragmented.
Our recent move to the item cache added a significant source of page
allocation churn which moved the system towards fragmentation much more
quickly and was causing high-order allocation failures in testing.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly. That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.
By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.
Most of this change is churn from changing allocator function and struct
names.
File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity. All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions. This now means
that fallocate and especially restoring offline extents can use larger
extents. Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.
The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing. The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks. This resulted in a lot of bugs. Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction. We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.
The server now only moves free extents into client allocators when they
fall below a low threshold. This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add an allocator which uses btree items to store extents. Both the
client and server will use this for btree blocks, the client will use it
for srch blocks and data extents, and the server will move extents
between the core fs allocator btree roots and the clients' roots.
Signed-off-by: Zach Brown <zab@versity.com>
Add infrastructure for working with extents. Callers provide callbacks
which operate on their extent storage while this code performs the
fiddly splitting and merging of extents. This layer doesn't have any
persitent structures itself, it only operates on native structs in
memory.
Signed-off-by: Zach Brown <zab@versity.com>
The percpu_counter library merges the per-cpu counters with a shared
count when the per-cpu counter gets larger than a certain value. The
default is very small, so we often end up taking a shared lock to update
the count. Use a larger batch so that we take the lock less often.
Signed-off-by: Zach Brown <zab@versity.com>
Now that the item cache is bearing the load of high frequency item
calls, we can remove all the item granular work that the forest was
trying to do. The item cache amortizes the cost of the forest so its
remaining methods can go straight to the btrees and don't need
complicated state to reduce the overhead of item calls.
Signed-off-by: Zach Brown <zab@versity.com>
Use the new item cache for all the item work in the fs instead of
calling into the forest of btrees. Most of this is mechanical
conversion from the _forest calls to the _item calls. The item cache
no longer supports the kvec argument for describing values so all the
callers pass in the value pointer and length directly.
The item cache doesn't support saving items as they're deleted and later
restoring them from an error unwinding path. There were only two users
of this. Directory entries can easily guarantee that deletion won't
fail by dirtying the items first in the item cache. Xattr updates were
a little trickier. They can combine dirtying, creating, updating, and
deleting to atomically switch between items that describe different
versions of a multi-item value. This also fixed a bug in the srch
xattrs where replacing an xattr would create a new id for the xattr and
leave existing srch items referencing a now deleted id. Replacing now
reuses the old id.
And finally we add back in the locking and transaction item cache
integration.
Signed-off-by: Zach Brown <zab@versity.com>
Add an item cache between fs callers and the forest of btrees. Calling
out to the btrees for every item operation was far too expensive. This
gives us a flexible in-memory structure for working with items that
isn't bound by the constrants of persistent block IO. We can rarely
stream large groups of items to and from the btrees and then use
efficient kernel memory structures for more frequent item operations.
This adds the infrastructure, nothing is calling it yet.
Signed-off-by: Zach Brown <zab@versity.com>
Add forest calls that the item cache will use. It needs to read all the
items in the leaf blocks of forest btree which could contain the key,
write dirty items to the log btree, and dirty bits in the bloom block as
items are dirtied.
Signed-off-by: Zach Brown <zab@versity.com>
Add btree calls to call a callback for all items in a leaf, and to
insert a list of items into their leaf blocks. These will be used by
the item cache to populate the cache and to write dirty items into dirty
btree blocks.
Signed-off-by: Zach Brown <zab@versity.com>
The current btree walk recorded the start and end of child subtrees as
it walked, and it could give the caller the next key to iterate towards
after the block it returned. Future methods want to get at the key
bounds of child subtrees, so we add a key range struct that all walk
callers provide and fill it with all the interesting keys calculated
during the walk.
Signed-off-by: Zach Brown <zab@versity.com>
Btree traversal doesn't split a block if it has room for the caller's
item. Extract this test into a function so that an upcoming btree call
can test that each of multiple insertions into a leaf will fit.
Signed-off-by: Zach Brown <zab@versity.com>
Remove the last remnants of the indexed xattrs which used fs items.
This makes the significant change of renumbering the key zones so I
wanted it in its own commit.
Signed-off-by: Zach Brown <zab@versity.com>
In a merge where the input and source trees are the same, the input
block can be an initial pre-cow version of the dirty source block.
Dirtying blocks in the change will clear allocations in the dirty source
block but they will remain in the pre-cow input block. The merge can
then set these blocks in the dst, even though they were also used by
allocation, because they're still set in the pre-cow input block.
This fix is clumsy, but minimal and specific to this problem. A more
thorough fix is being worked on which introduces more staging more
allocator trees and should stop calls that are modifying the current
active avail or free trees.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation has to make sure that changes are visible to future
readers. It was syncing if the current transaction is dirty. This was
never optimal, but it wasn't catastrophic when concurrent invalidation
work could all block on one sync in progress.
With the move to a single invalidation worker serially invalidating
locks it became unacceptable. Invalidation happening in the presence of
writers would constantly sync the current transaction while very old
unused write locks were invalidated. Their changes had long since been
committed in previous transactions.
We add a lock field to remember the transaction sequence which could
have been dirtied under the lock. If that transaction has already been
comitted by the time we invalidate the lock it doesn't have to sync.
Signed-off-by: Zach Brown <zab@versity.com>