Commit Graph

907 Commits

Author SHA1 Message Date
Andy Grover
cf278f5fa0 scoutfs: Tidy some enum usage
Prefer named to anonymous enums. This helps readability a little.

Use enum as param type if possible (a couple spots).

Remove unused enum in lock_server.c.

Define enum spbm_flags using shift notation for consistency.

Rename get_file_block()'s "gfb" parameter to "flags" for consistency.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-30 13:35:44 -08:00
Andy Grover
73333af364 scoutfs: Use enum for lock mode
Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-30 13:35:44 -08:00
Zach Brown
2f3d1c395e scoutfs: show metadev_path in sysfs/mount_options
We forgot to add metadev_path to the options that are found in the
mount_options sysfs directory.

Signed-off-by: Zach Brown <zab@versity.com>
2020-11-24 14:02:02 -08:00
Zach Brown
222e5f1b9d scoutfs: convert endian in SCOUTFS_IS_META_BDEV
We missed that flags is le64.

Signed-off-by: Zach Brown <zab@versity.com>
2020-11-24 14:02:02 -08:00
Zach Brown
08eb75c508 scoutfs: update README.md for metadev_path
Update the README.md introduction to scoutfs to mention the need for and
use of metadata and data block devices.

Signed-off-by: Zach Brown <zab@versity.com>
2020-11-19 11:41:20 -08:00
Andy Grover
9f151fde92 scoutfs: Use separate block devices for metadata and data
Require a second path to metadata bdev be given via mount option.

Verify meta sb matches sb also written to data sb. Change code as needed
in super.c to allow both to be read. Remove check for overlapping
meta and data blknos, since they are now on entirely separate bdevs.

Use meta_bdev for superblock, quorum, and block.c reads and writes.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-19 11:41:20 -08:00
Zach Brown
ff532eba75 scoutfs: recover max lock write_version
Write locks are given an increasing version number as they're granted
which makes its way into items in the log btrees and is used to find the
most recent version of an item.

The initialization of the lock server's next write_version for granted
locks dates back to the initial prototype of the forest of log btrees.
It is only initialized to zero as the module is loaded.  This means that
reloading the module, perhaps by rebooting, resets all the item versions
to 0 and can lead to newly written items being ignored in favour of
older existing items with greater versions from a previous mount.

To fix this we initialize the lock server's write_version to the
greatest of all the versions in items in log btrees.  We add a field to
the log_trees struct which records the greatest version which is
maintained as we write out items in transactions.  These are read by the
server as it starts.

Then lock recovery needs to include the write_version so that the
lock_server can be sure to set the next write_version past the greatest
version in the currently granted locks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-30 11:14:10 -07:00
Zach Brown
736d9d7df8 scoutfs: remove struct scoutfs_log_trees_val
The log_trees structs store the data that is used by client commits.
The primary struct is communicated over the wire so it includes the rid
and nr that identify the log.  The _val struct was stored in btree item
values and was missing the rid and nr because those were stored in the
item's key.

It's madness to duplicate the entire struct just to shave off those two
fields.  We can remove the _val struct and store the main struct in item
values, including the rid and nr.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-30 11:14:10 -07:00
Andy Grover
e6228ead73 scoutfs: Ensure padding in structs remains zeroed
Audit code for structs allocated on stack without initialization, or
using kmalloc() instead of kzalloc().

- avl.c: zero padding in avl_node on insert.
- btree.c: Verify item padding is zero, or WARN_ONCE.
- inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding.
- net.c: zero pad in net header.
- net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin().
- xattr.c: scoutfs_xattr has padding, zero it.
- forest.c: item_root in forest_next_hint() appears to either be
    assigned-to or unused, so no need to zero it.
- key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones}

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Andy Grover
13438c8f5d scoutfs: Remove struct scoutfs_betimespec
Unused.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Andy Grover
d9d9b65f14 scoutfs: remove __packed from all struct definitions
Instead, explicitly add padding field, and adjust member ordering to
eliminate compiler-added padding between members, and at the end of the
struct (if possible: some structs end in a u8[0] array.)

This should prevent unaligned accesses. Not a big deal on x86_64, but
other archs like aarch64 really want this.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Andy Grover
5e1c8586cc scoutfs: ensure btree values end on 8-byte-alignment boundary
Round val_len up to BTREE_VALUE_ALIGN (8), to keep mid_free_len aligned.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Andy Grover
68d7a2e2cb scoutfs: align items in item cache to 8 bytes
This will ensure structs, which are internally 8 byte aligned, will remain
so when in the item cache.

16 bytes alignment doesn't seem like it's needed so just do 8.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Andy Grover
87cb971630 scoutfs: fix hash compiler warnings
Signed-off-by: Andy Grover <agrover@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
dc47ec65e4 scoutfs: remove btree value owner footer offset
We were using a trailing owner offset to iterate over btree item values
from the back of the block towards the front.  We did this to reclaim
fragmented free space in a block to satisfy an allocation instead of
having to split the block, which is expensive mostly because it has to
allocate and free metadata blocks.

In the before times, we used to compact items by sorting items by their
offset, moving them, and then sorting them by their keys again.  The
sorting by keys was expensive so we added these owner offsets to be able
to compact without sorting.

But the complexity of maintaining the owner metadata is not worth it.
We can avoid the expensive sorting by keys by allocating a temporary
array of item offsets and sorting only it by the value offset.  That's
nice and quick, it was the key comparisons that were expensive.  Then we
can remove the owner offset entirely, as well as the block header final
free region that compaction needed.

And we also don't compact as often in the modern era because we do the
bulk of our work in the item cache instead of in the btree, and we've
changed the split/merge/compaction heuristics to avoid constantly
splitting/merging/comapcting and an item population happens to hover
right around a shared threshold.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
dbea353b92 scoutfs: bring back sort_priv
Bring back sort_priv, we have need for sorting with a caller argument.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-29 14:15:33 -07:00
Zach Brown
2e7053497e scoutfs: remove free_*_blocks super fields
Remove the old superblock fields which were used to track free blocks
found in the radix allocators.  We now walk all the allocators when we
need to know the free totals, rather than trying to keep fields in sync.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
735c2c6905 scoutfs: fix btree split/join setting parent keys
Before the introduction of the AVL tree to sort btree items, the items
were sorted by sorting a small packed array of offsets.  The final
offset in that array pointed to the item in the block with the greatest
key.

With the move to sorting items in an AVL tree by nodes embedded in item
structs, we now don't have the array of offsets and instead have a dense
array of items.  Creation and deletion of items always works with the
final item in the array.

last_item() used to return the item with the greatest key by returning
the item pointed to by the final entry in the sorted offset array, then
it returned the final entry in the item array for creation and deletion
but that was no longer the item with the greatest key.

But spliting and joining still used last_item() to find the item in the
block with the greatest key for updating references to blocks in
parents.  Since the introduction of the AVL tree splitting and joining
has been corrrupting the tree by setting parent block reference keys to
whatever item happened to be at the end of the array, not the item with
the greatest key.

The extent code recently pushed hard enough to hit this by working with
relatively random extent items in the core allocation btrees.
Eventually the parent block reference keys got out of sync and we'd fail
to find items by descending into the wrong children when looking for
them.  Extent deletion hit this during allocation, returned -ENOENT, and
the allocator turned that into -ENOSPC.

With this fixed we can repetedly create and delte millions of files with
heavily fragmented extents in a tiny metadata device.  Eventually it
actually runs out of space instead of spuriously returning ENOSPC in a
matter of minutes.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
a848477e64 scoutfs: remove unused packed exents
We use full data extent items now, we don't need the packed extent
structures.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
b094b18618 scoutfs: compact fewer srch files each time
With the introduction of incremental srch file compaction we added some
fields to the srch_compact struct to record the position of compaction
in each file.  This increased the size of the struct past the limit the
btree places on the size of item values.

We decrease the number of files per compaction from 8 to 4 to cut the
size of the srch_compcat struct in half.  This compacts twice as often,
but still relatively infrequently, and it uses half the space for srch
files waiting to hit the compaction threshold.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
7a3749d591 scoutfs: incremental srch compaction
Previously the srch compaction work would output the entire compacted
file and delete the input files in one atomic commit.  The server would
send the input files and an allocator to the client, and the client
would send back an output file and an allocator that included the
deletion of the input files.  The server would merge in the allocator
and replace the input file items with the output file item.

Doing it this way required giving an enormous allocation pool to the
client in a radix, which would deal with recursive operations
(allocating from and freeing to the radix that is being modified).  We
no longer have the radix allocator, and we use single block avail/free
lists instead of recursively modifying the btrees with free extent
items.  The compaction RPC needs to work with a finite amount of
allocator resources that can be stored in an alloc list block.

The compaction work now does a fixed amount of work and a compaction
operation spans multiple work iterations.

A single compaction struct is now sent between the client and server in
the get_compact and commit_compact messages.  The client records any
partial progress in the struct.  The server writes that position into
PENDING items.  It first searchs for pending items to give to clients
before searching for files to start a new compaction operation.

The compact struct has flags to indicate whether the output file is
being written or the input files are being deleted.  The server manages
the flags and sets the input file deletion flag only once the result of
the compaction has been reflected in the btree items which record srch
files.

We added the progress fields to the compaction struct, making it even
bigger than it already was, so we take the time to allocate them rather
than declaring them on the stack.

It's worth mentioning that each operation now takes a reasonably bounded
amount of time will make it feasible to decide that it has failed and
needs to be fenced.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
d589881855 scoutfs: add tot m/d device blocks to statfs_more
The total_{meta,data}_blocks scoutfs_super_block fields initialized by
mkfs aren't visible to userspace anywhere.  Add them to statfs_more so
that tools can get the totals (and use them for df, in this particular
case).

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
2073a672a0 scoutfs: remove unused statfs RPC
Remove the statfs RPC from the client and server now that we're using
allocator iteration to calculate free blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
33374d8fe6 scoutfs: get statfs free blocks with alloc_foreach
Use alloc_foreach to count the free blocks in all the allocators instead
of sending an RPC to the server.  We cache the results so that constant
df calls don't generate a constant stream of IO.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
3d790b24d5 scoutfs: add alloc_detail ioctl
An an ioctl which copies details of each persistent allocator to
userspace.  This will be used by a scoutfs command to give information
about the allocators in the system.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
fb66372988 scoutfs: add alloc foreach cb iterator
Add an alloc call which reads all the persistent allocators and calls a
callback for each.  This is going to be used to calculate free blocks
in clients for df, and in an ioctl to give a more detailed view of
allocators.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
8bf4c078df scoutfs: fix item cache page split key choice
The algorithm for choosing the split key assumed that there were
multiple items in the page.  That wasn't always true and it could result
in choosing the first item as the split key, which could end up
decrementing the left page's end key before it's start key.

We've since added compaction to the paths that split pages so we now
guarantee that we have at least two items in the page being split.  With
that we can be sure to use the second item's key and ensure that we're
never creating invalid keys for the pages created by the split.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
27bc0ef095 scoutfs: fix item cache page trim
The tests for the various page range intersections were out of order.
The edge overlap case could trigger before the bisection case and we'd
fail to remove the initial items in the page.  That would leave items
before the start key which would later be used as a midpoint for a
split, causing all kinds of chaos.

Rework the cases so that the overlap cases are last.  The unique bisect
case will be caught before we can mistake it for an edge overlap case.
And minimize the number of comparisons we calculate by storing the
handful that all the cases need.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
c4663ea1a1 scoutfs: compact items in item cache pages
The first pass of the item cache didn't try to reclaim freed space at
all.  It would leave behind very sparse pages.  The oldest of which
would be reclaimed by memory pressure.

While this worked, it created much more stress on the system than is
necessary.  Splitting a page with one key also makes it hard to
calculate the boundaries of the split pages, given that the start and
end keys could be the single item.

This adds a header field which tracks the free space in item cache
pgaes.  Free space is created before the alloc offset by removing items
from the rbtree, but also from shrinking item values when updating or
deleting items.

If we try to split a page with sufficient free space to insert the
largest possible item then we compact the page instead of splitting it.
We copy the items into the front of an unused page and swap the pages.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
e347ca3606 scoutfs: add unused item page rbtree verification
Add a quick function that walks the rbtree and makes sure it doesn't see
any obvious key errors.  This is far too expensive to use regularly but
it's handy to have around and add calls to when debugging.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
005cf99f42 scoutfs: use vmalloc for high order xattr allocs
The xattr item stream is constructred from a large contiguous region
that contains the struct header, the key, and the value.  The value
can be larger than a page so kmalloc is likely to fail as the system
gets fragmented.

Our recent move to the item cache added a significant source of page
allocation churn which moved the system towards fragmentation much more
quickly and was causing high-order allocation failures in testing.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
c61175e796 scoutfs: remove unused radix code
Remove the radix allocator that was added as we expermented with packed
extent items.  It didn't work out.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
e60f4e7082 scoutfs: use full extents for data and alloc
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly.  That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.

By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.

Most of this change is churn from changing allocator function and struct
names.

File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity.  All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions.  This now means
that fallocate and especially restoring offline extents can use larger
extents.  Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.

The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing.  The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks.  This resulted in a lot of bugs.  Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction.  We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.

The server now only moves free extents into client allocators when they
fall below a low threshold.  This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
8f946aa478 scoutfs: add btree item extent allocator
Add an allocator which uses btree items to store extents.  Both the
client and server will use this for btree blocks, the client will use it
for srch blocks and data extents, and the server will move extents
between the core fs allocator btree roots and the clients' roots.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
b605407c29 scoutfs: add extent layer
Add infrastructure for working with extents.  Callers provide callbacks
which operate on their extent storage while this code performs the
fiddly splitting and merging of extents.  This layer doesn't have any
persitent structures itself, it only operates on native structs in
memory.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
b28acdf904 scoutfs: use larger percpu_counter batch
The percpu_counter library merges the per-cpu counters with a shared
count when the per-cpu counter gets larger than a certain value.  The
default is very small, so we often end up taking a shared lock to update
the count.  Use a larger batch so that we take the lock less often.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
ae97ffd6fc scoutfs: remove unused kvec.h
We've removed the last use of kvecs to describe item values.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
12067e99ab scoutfs: remove item granular work from forest
Now that the item cache is bearing the load of high frequency item
calls, we can remove all the item granular work that the forest was
trying to do.  The item cache amortizes the cost of the forest so its
remaining methods can go straight to the btrees and don't need
complicated state to reduce the overhead of item calls.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
6bacd95aea scoutfs: fs uses item cache instead of forest
Use the new item cache for all the item work in the fs instead of
calling into the forest of btrees.  Most of this is mechanical
conversion from the _forest calls to the _item calls.  The item cache
no longer supports the kvec argument for describing values so all the
callers pass in the value pointer and length directly.

The item cache doesn't support saving items as they're deleted and later
restoring them from an error unwinding path.  There were only two users
of this.  Directory entries can easily guarantee that deletion won't
fail by dirtying the items first in the item cache.  Xattr updates were
a little trickier.  They can combine dirtying, creating, updating, and
deleting to atomically switch between items that describe different
versions of a multi-item value.  This also fixed a bug in the srch
xattrs where replacing an xattr would create a new id for the xattr and
leave existing srch items referencing a now deleted id.  Replacing now
reuses the old id.

And finally we add back in the locking and transaction item cache
integration.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
45e594396f scoutfs: add an item cache above the btrees
Add an item cache between fs callers and the forest of btrees.  Calling
out to the btrees for every item operation was far too expensive.  This
gives us a flexible in-memory structure for working with items that
isn't bound by the constrants of persistent block IO.  We can rarely
stream large groups of items to and from the btrees and then use
efficient kernel memory structures for more frequent item operations.

This adds the infrastructure, nothing is calling it yet.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
b1757a061e scoutfs: add forest methods for item cache
Add forest calls that the item cache will use.  It needs to read all the
items in the leaf blocks of forest btree which could contain the key,
write dirty items to the log btree, and dirty bits in the bloom block as
items are dirtied.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
1a994137f4 scoutfs: add btree methods for item cache
Add btree calls to call a callback for all items in a leaf, and to
insert a list of items into their leaf blocks.  These will be used by
the item cache to populate the cache and to write dirty items into dirty
btree blocks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
57af2bd34b scoutfs: give btree walk callers more keys
The current btree walk recorded the start and end of child subtrees as
it walked, and it could give the caller the next key to iterate towards
after the block it returned.  Future methods want to get at the key
bounds of child subtrees, so we add a key range struct that all walk
callers provide and fill it with all the interesting keys calculated
during the walk.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
9e975dffe1 scoutfs: refactor btree split condition
Btree traversal doesn't split a block if it has room for the caller's
item.  Extract this test into a function so that an upcoming btree call
can test that each of multiple insertions into a leaf will fit.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
d440056e6f scoutfs: remove unused xattr index code
Remove the last remnants of the indexed xattrs which used fs items.
This makes the significant change of renumbering the key zones so I
wanted it in its own commit.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
d1e62a43c9 scoutfs: fix leaking alloc bits in merge
In a merge where the input and source trees are the same, the input
block can be an initial pre-cow version of the dirty source block.
Dirtying blocks in the change will clear allocations in the dirty source
block but they will remain in the pre-cow input block.  The merge can
then set these blocks in the dst, even though they were also used by
allocation, because they're still set in the pre-cow input block.

This fix is clumsy, but minimal and specific to this problem.  A more
thorough fix is being worked on which introduces more staging more
allocator trees and should stop calls that are modifying the current
active avail or free trees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
289caeb353 scoutfs: trace leaf_bit of modified radix bits
Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
ba879b977a scoutfs: expand radix merge tracing
Add a trace event for entering _radix_merge() and rename the current
per-merge trace event.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
5c6b263d97 scoutfs: trace radix bit ops before assertions
Trace operations before they can trigger assertions so we can see the
violating operation in the traces.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
ca6b7f1e6d scoutfs: lock invalidate only syncs dirty
Lock invalidation has to make sure that changes are visible to future
readers.  It was syncing if the current transaction is dirty.  This was
never optimal, but it wasn't catastrophic when concurrent invalidation
work could all block on one sync in progress.

With the move to a single invalidation worker serially invalidating
locks it became unacceptable.  Invalidation happening in the presence of
writers would constantly sync the current transaction while very old
unused write locks were invalidated.  Their changes had long since been
committed in previous transactions.

We add a lock field to remember the transaction sequence which could
have been dirtied under the lock.  If that transaction has already been
comitted by the time we invalidate the lock it doesn't have to sync.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00