Commit Graph

24 Commits

Author SHA1 Message Date
Zach Brown
ef2daf8857 Make data preallocation tunable
Make mount options for the size of preallocation and whether or not it
should be restricted to extending writes.  Disabling the default
restriction to streaming writes lets it preallocate in aligned regions
of the preallocation size when they contain no extents.

Signed-off-by: Zach Brown <zab@versity.com>
2022-10-14 14:03:35 -07:00
Zach Brown
233fbb39f3 Limit alloc_move per-call allocator consumption
Recently scoutfs_alloc_move() was changed to try and limit the amount of
metadata blocks it could allocate or free.  The intent was to stop
concurrent holders of a transaction from fully consuming the available
allocator for the transaction.

The limiting logic was a bit off.  It stopped when the allocator had the
caller's limit remaining, not when it had consumed the caller's limit.
This is overly permissive and could still allow concurrent callers to
consume the allocator.  It was also triggering warning messages when a
call consumed more than its allowed budget while holding a transaction.

Unfortunately, we don't have per-caller tracking of allocator resource
consumption.  The best we can do is sample the allocators as we start
and return if they drop by the caller's limit.  This is overly
conservative in that it accounts any consumption during concurrent
callers to all callers.

This isn't perfect but it makes the failure case less likely and the
impact shouldn't be significant.  We don't often have a lot of
concurrency and the limits are larger than callers will typically
consume.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-29 11:25:01 -07:00
Zach Brown
198d3cda32 Add scoutfs_alloc_meta_low_since()
Add scoutfs_alloc_meta_low_since() to test if the metadata avail or
freed resources have been used by a given amount since a previous
snapshot.

Signed-off-by: Zach Brown <zab@versity.com>
2022-07-29 11:24:10 -07:00
Zach Brown
0d4bf83da3 Reclaim log_trees alloc roots in multiple commits
Client log_trees allocator btrees can build up quite a number of
extents.  In the right circumstances fragmented extents can have to
dirty a large number of paths to leaf blocks in the core allocator
btrees.  It might not be possible to dirty all the blocks necessary to
move all the extents in one commit.

This reworks the extent motion so that it can be performed in multiple
commits if the meta allocator for the commit runs out while it is moving
extents.  It's a minimal fix with as little disruption to the ordering
of commits and locking as possible.  It simply bubbles up an error when
the allocators run out and retries functions that can already be retried
in other circumstances.

Signed-off-by: Zach Brown <zab@versity.com>
2022-06-08 11:53:53 -07:00
Zach Brown
96ad8dd510 Add scoutfs_alloc_meta_remaining
Add helper function to give the caller the number of blocks remaining in
the first list block that's used for meta allocation and freeing.

Signed-off-by: Zach Brown <zab@versity.com>
2022-04-01 15:21:44 -07:00
Zach Brown
a53d6d1a8e Add scoutfs_alloc_foreach_super which takes super
Add an alloc_foreach variant which uses the caller's super to walk the
allocators rather than always reading it off the device.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
6d0694f1b0 Add resize_devices ioctl and scoutfs command
Add a scoutfs command that uses an ioctl to send a request to the server
to safely use a device that has grown.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:26:32 -07:00
Zach Brown
73bf916182 Return ENOSPC as space gets low
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free.  This adds support for
returning ENOSPC to client posix allocators as free space gets low.

For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space.  The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks.  In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing).  When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.

Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.

For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.

The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.

We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.

We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.

And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-07 14:13:14 -07:00
Zach Brown
9c2122f7de Add server btree merge processing
This adds the server processing side of the btree merge functionality.
The client isn't yet sending the log_merge messages so no merging will
be performed.

The bulk of the work happens as the server processess a get_log_merge
message to build a merge request for the client.  It starts a log merge
if one isn't in flight.  If one is in flight it checks to see if it
should be spliced and maybe finished.  In the common case it finds the
next range to be merged and sends the request to the client to process.

The commit_log_merge handler is the completion side of that request.  If
the request failed then we unwind its resources based on the stored
request item.  If it succeeds we record it in an item for get_
processing to splice eventually.

Then we modify two existing server code paths.

First, get_log_tree doesn't just create or use a single existing log
btree for a client mount.  If the existing log btree is large enough it
sets its finalized flag and advances the nr to use a new log btree.
That makes the old finalized log btree available for merging.

Then we need to be a bit more careful when reclaiming the open log btree
for a client.  We can't use next to find the only open log btree, we use
prev to find the last and make sure that it isn't already finalized.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:36:00 -07:00
Zach Brown
52c2a465db Add zone awareness to scoutfs_alloc_move()
Add parameters so that scoutfs_alloc_move() can first search for source
extents in specified zones.  It uses relatively cheap searches through
the order items to find extents that intersect with the regions
described by the zone bitmaps.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
bc4975fad4 Add scoutfs_alloc_extents_cb()
Add an allocator call for getting a callback for all the extents in
btree items in an allocator root.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
c470c1c9f6 Allow read-mostly _alloc_meta_low
Each transaction hold makes multiple calls to _alloc_meta_low to see if
the transaction should be committed to refill allocators before the
caller's hold is acquired and they can dirty blocks in the transaction.

_alloc_meta_low was using a spinlock to sample the allocator list_head
blocks to determine if there was space available.  The lock and unlock
stores were creating significant cacheline contention.

The _alloc_meta_low calls are higher frequency than allocations.  We can
use a seqlock to have exclusive writers and allow concurrent
_alloc_meta_low readers who retry if a writer intervenes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-04 11:39:30 -08:00
Zach Brown
fc003a5038 Consistently sample data alloc total_len
With many concurrent writers we were seeing excessive commits forced
because it thought the data allocator was running low.  The transaction
was checking the raw total_len value in the data_avail alloc_root for
the number of free data blocks.  But this read wasn't locked, and
allocators could completely remove a large free extent and then
re-insert a slightly smaller free extent as they perform their
alloction.  The transaction could see a temporary very small total_len
and trigger a commit.

Data allocations are serialized by a heavy mutex so we don't want to
have the reader try and use that to see a consistent total_len.  Instead
we create a data allocator run-time struct that has a consistent
total_len that is updated after all the extent items are manipulated.
This also gives us a place to put the caller's cached extent so that it
can be included in the total_len, previously it wasn't included in the
free total that the transaction saw.

The file data allocator can then initialize and use this struct instead
of its raw use of the root and cached extent.  Then the transaction can
sample its consistent total_len that reflects the root and cached
extent.

A subtle detail is that fallocate can't use _free_data to return an
allocated extent on error to the avail pool.  It instead frees into the
data_free pool like normal frees.  It doesn't really matter that this
could prematurely drain the avail pool because it's in an error path.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-06 09:25:32 -08:00
Zach Brown
9375b9d3b7 scoutfs: commit while enough meta for dirty items
Dirty items in a client transaction are stored in OS pages.  When the
transaction is committed each item is stored in its position in a dirty
btree block in the client's existing log btree.  Allocators are refilled
between transaction commits so a given commit must have sufficient meta
allocator space (avail blocks and unused freed entries) for all the
btree blocks that are dirtied.

The number of btree blocks that are written, thus the number of cow
allocations and frees, depends on the number of blocks in the log btree
and the distribution of dirty items amongst those blocks.  In a typical
load items will be near each other and many dirty items in smaller
kernel pages will be stored in fewer larger btree blocks.

But with the right circumstances, the ratio of dirty pages to dirty
blocks can be much smaller.  With a very large directory and random
entry renames you can easily have 1 btree block dirtied for every page
of dirty items.

Our existing allocator meta allocator fill targets and the number of
dirty item cache pages we allowed did not properly take this in to
account.  It was possible (and, it turned out, relatively easy to test
for with a hgue directory and random renames) to run out of meta avail
blocks while storing dirty items in dirtied btree blocks.

This rebalances our targets and thresholds to make it more likely that
we'll have enough allocator resources to commit dirty items.  Instead of
having an arbitrary limit on the number of dirty item cache pages, we
require that a given number of dirty item cache pages have a given
number of allocator blocks available.

We require a decent number of avialable blocks for each dirty page, so
we increase the server's target number of blocks to give the client so
that it can still build large transactions.

This code is conservative and should not be a problem in practice, but
it's theoretically possible to build a log btree and set of dirty items
that would dirty more blocks that this code assumes.  We will probably
revisit this as we add proper support for ENOSPC.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-02 09:25:13 -08:00
Zach Brown
a5d9ac5514 scoutfs: rework scoutfs_alloc_meta_low, takes arg
Previously, scoutfs_alloc_meta_lo_thresh() returned true when a small
static number of metadata blocks were either available to allocate or
had space for freeing.  This didn't make a lot of sense as the correct
number depends on how many allocations each caller will make during
their atomic transaction.

Rework the call to take an argument for the number of avail or freed
blocks available to test.  This first pass just uses the existing
number, we'll get to the callers.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-02 09:25:13 -08:00
Zach Brown
fb66372988 scoutfs: add alloc foreach cb iterator
Add an alloc call which reads all the persistent allocators and calls a
callback for each.  This is going to be used to calculate free blocks
in clients for df, and in an ioctl to give a more detailed view of
allocators.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
e60f4e7082 scoutfs: use full extents for data and alloc
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly.  That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.

By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.

Most of this change is churn from changing allocator function and struct
names.

File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity.  All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions.  This now means
that fallocate and especially restoring offline extents can use larger
extents.  Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.

The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing.  The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks.  This resulted in a lot of bugs.  Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction.  We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.

The server now only moves free extents into client allocators when they
fall below a low threshold.  This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
8f946aa478 scoutfs: add btree item extent allocator
Add an allocator which uses btree items to store extents.  Both the
client and server will use this for btree blocks, the client will use it
for srch blocks and data extents, and the server will move extents
between the core fs allocator btree roots and the clients' roots.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
1b3645db8b scoutfs: remove dead server allocator code
Remove the bitmap segno allocator code that the server used to use to
manage allocations.

Signed-off-by: Zach Brown <zab@versity.com>
2018-06-29 14:42:06 -07:00
Zach Brown
ff5a094833 scoutfs: store allocator regions in btree
Convert the segment allocator to store its free region bitmaps in the
btree.

This is a very straight forward mechanical transformation.  We split the
allocator region into a big-endian index key and the bitmap value
payload.  We're careful to operate on aligned copies of the bitmaps so
that they're long aligned.

We can remove all the funky functions that were needed when writing the
ring.  All we're left with is a call to apply the pending allocations to
dirty btree blocks before writing the btree.

Signed-off-by: Zach Brown <zab@versity.com>
2017-07-08 10:59:40 -07:00
Zach Brown
5e0e9ac12e Move to much simpler manifest/alloc storage
Using the treap to be able to incrementally read and write the manifest
and allocation storage from all nodes wasn't quite ready for prime time.
The biggest problem is that invalidating cached nodes which are the
target of native pointers, either for consistency or memory pressure, is
problematic.  This was getting in the way of adding shared support as
readers and writers try to use as much of their treap caches as they
can.  There were other serious problems that we'd run into eventually:
memory pressure from duplicate caching in native nodes and the page
cache, small IOs from reading a page at a time, the risk of
pathologically imbalanced treaps, and the ring being corrupted if the
migration balancing doesn't work (the model assumed you could always
dirty an individual node in a transaction, you have to dirty all the
parents in each new transaction).

Let's back off to a much simpler mechanism while we build the rest of
the system around it.  We can revisit aggressively optimizing this when
it's our worst problem.

We'll store the indexes that the manifest server needs in simple
preallocated rings with log entries.   The server has to read the index
in its entirety into a native rbtree before it can work on it.  We won't
access the physical ring from mounts anymore, they'll send messages to
the server.

The ring callers are now working with a pinned tree in memory so the
interface can be a bit simpler.  By storing the indexes in their own
rings the code and write path become a lot simper: we have an IO
submission path for each index instead of "dirtying" calls per index and
then a writing call.

All this is much more robust and much less likely to get in our way as
we stand up the rest of the system around it.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:51:10 -07:00
Zach Brown
6516ce7d57 Report free blocks in statfs
Our statfs callback was still using the old buddy allocator.

We add a free segments field to the super and have it track the number
of free segments in the allocator.  We then use that to calculate the
number of free blocks for statfs.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:54 -07:00
Zach Brown
db9f2be728 Switch to indexed manifest using treap ring
The first pass manifest and allocator storage used a simple ring log
that was entirely replayed into memory to be used.  That risked the
manifest being too large to fit in memory, especially with large keys
and large volumes.

So we move to using an indexed persistent structure that can be read on
demand and cached.  We use a treap of byte referenced nodoes stored in a
circular ring.

The code interface is modeled a bit on the in-memory rbtree interface.
Except that we can get IO errors and manage allocation so we return data
pointers to the item payload istead of item structs and we can return
errors.

The manifest and allocator are converted over and the old ring code is
removed entirely.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:44:53 -07:00
Zach Brown
c4954eb6f4 Add initial LSM write implementation
Add all the core strutural components to be able to modify metadata.  We
modify items in fs write operations, track dirty items in the cache,
allocate free segment block reagions, stream dirty items into segments,
write out the segments, update the manifest to reference the written
segments, and write out a new ring that has the new manifest.

Signed-off-by: Zach Brown <zab@versity.com>
2017-04-18 13:42:30 -07:00