Commit Graph

25 Commits

Author SHA1 Message Date
Auke Kok
1d150da3f0 Use page->lru instead of page->list
With v3.14-rc1-10-g34bf6ef94a83, page->list is removed Instead,
use the union member ->lru.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2023-10-09 15:35:40 -04:00
Zach Brown
2e2ccb6f61 Allow replaying srch file rotation
When a client no longer needs to append to a srch file, for whatever
reason, we move the reference from the log_trees item into a specific
srch file btree item in the server's srch file tracking btree.

Zeroing the log_trees item and inserting the server's btree item are
done in a server commit and should be written atomically.

But commit_log_trees had an error handling case that could leave the
newly inserted item dirty in memory without zeroing the srch file
reference in the existing log_trees item.  Future attempts to rotate the
file reference, perhaps by retrying the commit or by reclaiming the
client's rid, would get EEXIST and fail.

This fixes the error handling path to ensure that we'll keep the dirty
srch file btree and log_trees item in sync.  The desynced items can
still exist in the world so we'll tolerate getting EEXIST on insertion.
After enough time has passed, or if repair zeroed the duplicate
reference, we could remove this special case from insertion.

Signed-off-by: Zach Brown <zab@versity.com>
2023-01-17 14:33:27 -08:00
Zach Brown
fff07ce19c Use stale block read retrying helper
Transition from manual checking for persistent ESTALE to the shared
helper that we just added.  This should not change behavior.

Signed-off-by: Zach Brown <zab@versity.com>
2022-12-12 14:59:22 -08:00
Zach Brown
b477604339 Don't clobber srch compact errors
The srch compaction worker will wait a bit before attempting another
compaction as it finishes a compaction that failed.

Unfortunately, it clobbered the errors it got during compaction with the
result of sending the commit to the server with the error flag.  If the
commit is successful then it thinks there were no errors and immediately
re-queues itself to try the next compaction.

If the error is persistent, as it was with a bug in how we merged log
files with a single page's worth of entries, then we can spin
indefinitely getting and error, clobbering the error with the commit
result, and immediately queueing our work to do it all over again.

This fix preserves existing errors when geting the result of the commit
and will correctly back off.  If we get persistent merge errors at least
they won't consume significant resources.  We add a counter for commit
for the errors so we can get some visibility if this happens.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
75f9aabe75 Allow compacting logs down to a single page
The k-way merge function at the core of the srch file entry merging had
some bookkeeping math (calculating number of parents) that couldn't
handle merging a single incoming entry stream, so it threw a warning and
returned an error.  When refusing to handle that case, it was assuming
that caller was trying to merge down a single log file which doesn't
make any sense.

But in the case of multiple small unsorted logs we can absolutely end up
with their entries stored in one sorted page.   We have one sorted input
page that's merging multiple log files.  The merge function is also the
path that writes to the output file so we absolutely need to handle this
case.

We more carefully calculate the number of parents, clamping it to one
parent when we'd otherwise get "(roundup(1) -> 1) - 1 == 0" when
calculating the number of parents from the number of inputs.  We can
relax the warning and error to refuse to merge nothing.

The test triggers this case by putting single search entries in the log
files for mounts and unmounting them to force rotation of the mount log
files into mergable rotated log files.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
d5eec7d001 Fix uninitialized srch ret that won't happen
More recent gcc notices that ret in delete_files can be undefined if nr
is 0 while missing that we won't call delete_files in that case.  Seems
worth fixing, regardless.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
28759f3269 Rotate srch files as log trees items are reclaimed
The log merging work deletes log trees items once their item roots are
merged back into the fs root.  Those deleted items could still have
populated srch files that would be lost.  We force rotation of the srch
files in the items as they're reclaimed to turn them into rotated srch
files that can be compacted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-17 09:37:45 -07:00
Zach Brown
1259f899a3 srch compaction needs to prepare alloc for commit
The srch client compaction work initializes allocators, dirties blocks,
and writes them out as its transaction.  It forgot to call the
pre-commit allocator prepare function.

The prepare function drops block references used by the meta allocator
during the transaction.  This leaked block references which kept blocks
from being freed by the shrinker under memory pressure.  Eventually
memory was full of leaked blocks and the shrinker walked all of them
looking blocks to free, resulting in an effective livelock that ground
the system to a crawl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-01 13:04:40 -07:00
Zach Brown
6237f0adc5 Add _block_dirty_ref to dirty blocks in one place
To create dirty blocks in memory each block type caller currently gets a
reference on a created block and then dirties it.  The reference it gets
could be an existing cached block that stale readers are currently
using.  This creates a problem with our block consistency protocol where
writers can dirty and modify cached blocks that readers are currently
reading in memory, leading to read corruption.

This commit is the first step in addressing that problem.  We add a
scoutfs_block_dirty_ref() call which returns a reference to a dirtied
block from the block core in one call.  We're only changing the callers
in this patch but we'll be reworking the dirtying mechanism in an
upcoming patch to avoid corrupting readers.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:49:17 -08:00
Zach Brown
0969a94bfc Check one block_ref struct in block core
Each of the different block types had a reading function that read a
block and then checked their reference struct for their block type.

This gets rid of each block reference type and has a single block_ref
type which is then checked by a single ref reading function in the block
core.  By putting ref checking in the core we no longer have to export
checking the block header crc, verifying headers, invalidating blocks,
or even reading raw blocks themseves.  Everyone reads refs and leaves
the checking up to the core.

The changes don't have a significant functional effect.  This is mostly
just changing types and moving code around.  (There are some changes to
visible counters.)

This shares code, which is nice, but this is putting the block reference
checking in one place in the block core so that in a few patches we can
fix problems with writers dirtying blocks that are being read.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:49:17 -08:00
Zach Brown
d39268bbc1 Fix spurious EIO from scoutfs_srch_get_compact
scoutfs_srch_get_compact() is building up a compaction request which has
a list of srch files to read and sort and write into a new srch file.
It finds input files by searching for a sufficient number of similar
files: first any unsorted log files and then sorted log files that are
around the same size.

It finds the files by using btree next on the srch zone which has types
for unsorted srch log files, sorted srch files, but also pending and
busy compaction items.

It was being far too cute about iterating over different key types.  It
was trying to adapt to finding the next key and was making assumptions
about the order of key types.  It didn't notice that the pending and
busy key types followed log and sorted and would generate EIO when it
ran into them and found their value length didn't match what it was
expecting.

Rework the next item ref parsing so that it returns -ENOENT if it gets
an unexpected key type, then look for the next key type when checking
enoent.

Signed-off-by: Zach Brown <zab@versity.com>
2021-01-26 14:46:07 -08:00
Zach Brown
18aee0ebbd scoutfs: fix lost entries in resumed srch compact
Compacting very large srch files can use all of a given operation's
metadata allocator.  When this happens we record the position in the
srch files of the compcation in the pending item.

We could lose entries when this happens because the kway_next callback
would advance the srch file position as it read entries and put them in
the tournament tree leaves, not as it put them in the output file.  We'd
continue from the entries that were next to go in the tournament leaves,
not from what was in the leaves.

This refactors the kway merge callbacks to differentiate between getting
entries at the position and advancing the positions.  We initialize the
tournament leaves by getting entries at the positions and only advance
the position as entries leave the tournament tree and are either stored
in the output srch files or are dropped.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
c35f1ff324 scoutfs: inc end when search xattrs retries
In the rare case that searching for xattrs only finds deletions within
its window it retries the search past the window.  The end entry is
inclusive and is the last entry that can be returned.  When retrying the
search we need to start from the entry after that to ensure forward
progress.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
6770a31683 scoutfs: consistently trim srch entry range
We have to limit the number of srch entries that we'll track while
performing a search for all the inodes that contain xattrs that match
the search hash value.

As we hit the limit on the number of entries to track we have to drop
entries.  As we drop entries we can't return any inodes for entries
past the dropped entries.  We were updating the end point of the search
as we dropped entries past the tracked set, but we weren't updating the
search end point if we dropped the last currently tracked entry.

And we were setting the end point to the dropped entry, not to the entry
before it.  This could lead us to spuriously returning deleted entries
if we drop the creation entry and then allow tracking its deletion
later.

This fixes both those problems.  We now properly set the end point to
just before the dropped entry for all entries that we drop.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
9395360324 scoutfs: add srch entry inc/dec
We're going to need to increment and decrement srch entries in coming
fixes.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
7c5823ad12 scoutfs: drop duplicate compacted srch entries
The k-way merge used by srch file compaction only dropped the second
entry in a pair of duplicate entries.  Duplicate entries are both
supposed to be removed so that entries for removed xattrs don't take up
space in the files.

This both drops the second entry and removes the first encoded entry.
As we encode entries we rememeber their starting offset and the previous
entry that they were encoded from.  When we hit a duplicate entry
we undo the encoding of the previous entry.

This only works wihin srch file blocks.  We can still have duplicate
entries that span blocks but that's unlikely and relatively harmless.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
560c91a0e4 scoutfs: fix binary search for sorted srch block
The search_xattrs ioctl looks for srch entries in srch files that map
the caller's hashed xattr name to inodes.  As it searches it maintains a
range of entries that it is looking for.  When it searches sorted srch
files for entries it first performs a binary search for the start of the
range and then iterates over the blocks until it reaches the end of its
range.

The binary search for the start of the range was a bit wrong.  If the
start of the range was less than all the blocks then the binary search
could wrap the left index, try to get a file block at a negative index,
and return an error for the search.

This is relatively hard to hit in practice.  You have to search for the
xattr name with the smallest hashed value and have a sorted srch file
that's just the right size so that blk offset 0 is the last block
compared in the binary search, which sets the right index to -1.  If
there are lots of xattrs, or sorted files of the wrong length, it'll
work.

This fixes the binary search so that it specifically records the first
block offset that intersects with the range and tests that the left and
right offsets haven't been inverted.  Now that we're not breaking out of
the binary search loop we can more obviously put each block reference
that we get.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
4647a6ccb2 scoutfs: fix srch btree iref puts
The srch code was putting btree item refs outside of success.  This is
fine, but they only need to be put when btree ops return success and
have set the reference.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-03 09:58:35 -08:00
Zach Brown
ae286bf837 scoutfs: update srch _alloc_meta_low callers
The srch system checks that is has allocator space while deleting srch
files and while merging them and dirtying output blocks.  Update the
callers to check for the correct number of avail or freed blocks that it
needs between each check.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-02 09:25:13 -08:00
Zach Brown
a5d9ac5514 scoutfs: rework scoutfs_alloc_meta_low, takes arg
Previously, scoutfs_alloc_meta_lo_thresh() returned true when a small
static number of metadata blocks were either available to allocate or
had space for freeing.  This didn't make a lot of sense as the correct
number depends on how many allocations each caller will make during
their atomic transaction.

Rework the call to take an argument for the number of avail or freed
blocks available to test.  This first pass just uses the existing
number, we'll get to the callers.

Signed-off-by: Zach Brown <zab@versity.com>
2020-12-02 09:25:13 -08:00
Andy Grover
cf278f5fa0 scoutfs: Tidy some enum usage
Prefer named to anonymous enums. This helps readability a little.

Use enum as param type if possible (a couple spots).

Remove unused enum in lock_server.c.

Define enum spbm_flags using shift notation for consistency.

Rename get_file_block()'s "gfb" parameter to "flags" for consistency.

Signed-off-by: Andy Grover <agrover@versity.com>
2020-11-30 13:35:44 -08:00
Zach Brown
736d9d7df8 scoutfs: remove struct scoutfs_log_trees_val
The log_trees structs store the data that is used by client commits.
The primary struct is communicated over the wire so it includes the rid
and nr that identify the log.  The _val struct was stored in btree item
values and was missing the rid and nr because those were stored in the
item's key.

It's madness to duplicate the entire struct just to shave off those two
fields.  We can remove the _val struct and store the main struct in item
values, including the rid and nr.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-30 11:14:10 -07:00
Zach Brown
7a3749d591 scoutfs: incremental srch compaction
Previously the srch compaction work would output the entire compacted
file and delete the input files in one atomic commit.  The server would
send the input files and an allocator to the client, and the client
would send back an output file and an allocator that included the
deletion of the input files.  The server would merge in the allocator
and replace the input file items with the output file item.

Doing it this way required giving an enormous allocation pool to the
client in a radix, which would deal with recursive operations
(allocating from and freeing to the radix that is being modified).  We
no longer have the radix allocator, and we use single block avail/free
lists instead of recursively modifying the btrees with free extent
items.  The compaction RPC needs to work with a finite amount of
allocator resources that can be stored in an alloc list block.

The compaction work now does a fixed amount of work and a compaction
operation spans multiple work iterations.

A single compaction struct is now sent between the client and server in
the get_compact and commit_compact messages.  The client records any
partial progress in the struct.  The server writes that position into
PENDING items.  It first searchs for pending items to give to clients
before searching for files to start a new compaction operation.

The compact struct has flags to indicate whether the output file is
being written or the input files are being deleted.  The server manages
the flags and sets the input file deletion flag only once the result of
the compaction has been reflected in the btree items which record srch
files.

We added the progress fields to the compaction struct, making it even
bigger than it already was, so we take the time to allocate them rather
than declaring them on the stack.

It's worth mentioning that each operation now takes a reasonably bounded
amount of time will make it feasible to decide that it has failed and
needs to be fenced.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
e60f4e7082 scoutfs: use full extents for data and alloc
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly.  That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.

By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.

Most of this change is churn from changing allocator function and struct
names.

File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity.  All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions.  This now means
that fallocate and especially restoring offline extents can use larger
extents.  Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.

The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing.  The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks.  This resulted in a lot of bugs.  Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction.  We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.

The server now only moves free extents into client allocators when they
fall below a low threshold.  This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
f8e1812288 scoutfs: add srch infrastructure
This introduces the srch mechanism that we'll use to accelerate finding
files based on the presence of a given named xattr.  This is an
optimized version of the initial prototype that was using locked btree
items for .indx. xattrs.

This is built around specific compressed data structures, having the
operation cost match the reality of orders of magnitude more writers
than readers, and adopting a relaxed locking model.  Combine all of this
and maintaining the xattrs no longer tanks creation rates while
maintaining excellent search latencies, given that searches are defined
as rare and relatively expensive.

The core data type is the srch entry which maps a hashed name to an
inode number.  Mounts can append entries to the end of unsorted log
files during their transaction.  The server tracks these files and
rotates them into a list of files as they get large enough.  Mounts have
compaction work that regularly asks the server for a set of files to
read and combine into a single sorted output file.  The server only
initiates compactions when it sees a number of files of roughly the same
size.  Searches then walk all the commited srch files, both log files
and sorted compacted files, looking for entries that associate an xattr
name with an inode number.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00