Commit Graph

18 Commits

Author SHA1 Message Date
Auke Kok
5e2009f939 Avoid double counting deltas from non-input finalized log trees.
Readers currently accumulate all finalized log tree deltas into
a single bucket for deciding whether they are already in fs_root
or not, but, finalized trees that aren't inputs to a current merge
will have higher seqs, and thus we may be double applying deltas
already merged into fs_root.

To distinguish, scoutfs_totl_merge_contribute() needs to know the
merge status item seq.  We change wkic's get_roots() from using the
SCOUTFS_NET_CMD_GET_ROOTS RPC to reading the superblock directly.
This is needed because totl merge resolution has to use the same data
as the btree roots it is operating on, thus we can't grab it from a
SCOUTFS_NET_CMD_GET_ROOTS packet - it likely is different.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-04-10 13:50:21 -07:00
Auke Kok
8bdc20af21 Rename/reword FINALIZED to MERGE_INPUT.
These mislabeled members and enums were clearly not describing
the actual data being handled and obfuscating the intent of
avoiding mixing merge input items with non-merge input items.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2026-04-10 13:50:21 -07:00
Zach Brown
5f156b7a36 Add scoutfs_forest_read_items_roots
Add a forest item reading interface that lets the caller specify the net
roots instead of always getting them from a network request.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
4b87045447 Pre-declare scoutfs_lock in forest.h
Definitions in forest.h use lock pointers.  Pre-declare the struct so it
doesn't break inclusion without lock.h, following current practice in
the header.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-25 15:11:20 -07:00
Zach Brown
95ed36f9d3 Maintain inode count in super and log trees
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct.   Client transactions track the change in
inode count as they create and delete inodes.   The log_trees delta is
added to the count in the super as finalized log_trees are deleted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
b9a0f1709f Add xattr .totl. tag
Add the .totl. xattr tag.  When the tag is set the end of the name
specifies a total name with 3 encoded u64s separated by dots.  The value
of the xattr is a u64 that is added to the named total.   An ioctl is
added to read the totals.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-13 14:41:07 -07:00
Zach Brown
a59fd5865d Add seq and flags to btree items
The fs log btrees have values that start with a header that stores the
item's seq and flags.  There's a lot of sketchy code that manipulates
the value header as items are passed around.

This adds the seq and flags as core item fields in the btree.   They're
only set by the interfaces that are used to store fs items: _insert_list
and _merge.  The rest of the btree items that use the main interface
don't work with the fields.

This was done to help delta items discover when logged items have been
merged before the finalized lob btrees are deleted and the code ends up
being quite a bit cleaner.

Signed-off-by: Zach Brown <zab@versity.com>
2021-09-09 14:44:55 -07:00
Zach Brown
d6bed7181f Remove almost all interruptible waits
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.

The reality is that we have significant code paths that have trouble
unwinding.  Final inode deletion during iput->evict in a task is a good
example.  It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.

It also happens that golang built pre-emptive thread scheduling around
signals.  Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.

This changes waits to expect that IOs (including network commands) will
complete reasonably promptly.  We remove all interruptible waits with
the notable exception of breaking out of a pending mount.  That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-30 13:22:42 -07:00
Zach Brown
9db3b475c0 Stop log merge work earlier during unmount
The forest log merge work calls into the client to send commit requests
to the server.  The forest is usually destroyed relatively late in the
sequence and can still be running after the client is destroyed.

Adding a _forest_stop call lets us stop the log merging work
before the client is destroyed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-07-02 10:54:56 -07:00
Zach Brown
65c39e5f97 Item seq is max of trans and lock write_seq
Rename the item version to seq and set it to the max of the transaction
seq and the lock's write_seq.  This lets btree item merging chose a seq
at which all dirty items written in future commits must have greater
seqs.  It can drop the seqs from items written to the fs tree during
btree merging knowing that there aren't any older items out in
transactions that could be mistaken for newer items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-06-15 15:25:14 -07:00
Zach Brown
ff532eba75 scoutfs: recover max lock write_version
Write locks are given an increasing version number as they're granted
which makes its way into items in the log btrees and is used to find the
most recent version of an item.

The initialization of the lock server's next write_version for granted
locks dates back to the initial prototype of the forest of log btrees.
It is only initialized to zero as the module is loaded.  This means that
reloading the module, perhaps by rebooting, resets all the item versions
to 0 and can lead to newly written items being ignored in favour of
older existing items with greater versions from a previous mount.

To fix this we initialize the lock server's write_version to the
greatest of all the versions in items in log btrees.  We add a field to
the log_trees struct which records the greatest version which is
maintained as we write out items in transactions.  These are read by the
server as it starts.

Then lock recovery needs to include the write_version so that the
lock_server can be sure to set the next write_version past the greatest
version in the currently granted locks.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-30 11:14:10 -07:00
Zach Brown
e60f4e7082 scoutfs: use full extents for data and alloc
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly.  That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.

By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.

Most of this change is churn from changing allocator function and struct
names.

File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity.  All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions.  This now means
that fallocate and especially restoring offline extents can use larger
extents.  Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.

The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing.  The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks.  This resulted in a lot of bugs.  Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction.  We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.

The server now only moves free extents into client allocators when they
fall below a low threshold.  This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.

Signed-off-by: Zach Brown <zab@versity.com>
2020-10-26 15:19:03 -07:00
Zach Brown
12067e99ab scoutfs: remove item granular work from forest
Now that the item cache is bearing the load of high frequency item
calls, we can remove all the item granular work that the forest was
trying to do.  The item cache amortizes the cost of the forest so its
remaining methods can go straight to the btrees and don't need
complicated state to reduce the overhead of item calls.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
b1757a061e scoutfs: add forest methods for item cache
Add forest calls that the item cache will use.  It needs to read all the
items in the leaf blocks of forest btree which could contain the key,
write dirty items to the log btree, and dirty bits in the bloom block as
items are dirtied.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
f8e1812288 scoutfs: add srch infrastructure
This introduces the srch mechanism that we'll use to accelerate finding
files based on the presence of a given named xattr.  This is an
optimized version of the initial prototype that was using locked btree
items for .indx. xattrs.

This is built around specific compressed data structures, having the
operation cost match the reality of orders of magnitude more writers
than readers, and adopting a relaxed locking model.  Combine all of this
and maintaining the xattrs no longer tanks creation rates while
maintaining excellent search latencies, given that searches are defined
as rare and relatively expensive.

The core data type is the srch entry which maps a hashed name to an
inode number.  Mounts can append entries to the end of unsorted log
files during their transaction.  The server tracks these files and
rotates them into a list of files as they get large enough.  Mounts have
compaction work that regularly asks the server for a set of files to
read and combine into a single sorted output file.  The server only
initiates compactions when it sees a number of files of roughly the same
size.  Searches then walk all the commited srch files, both log files
and sorted compacted files, looking for entries that associate an xattr
name with an inode number.

Signed-off-by: Zach Brown <zab@versity.com>
2020-08-26 14:39:12 -07:00
Zach Brown
85142dcadf scoutfs: use radix allocator
Convert metadata block and file data extent allocations to use the radix
allocator.

Most of this is simple transitions between types and calls.  The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix.  We remove the code and fields that were responsible for adding
uninitialized data and metadata.

The rest of the unused block allocator code is only ifdefed out.  It'll
be removed in a separate patch to reduce noise here.

Signed-off-by: Zach Brown <zab@versity.com>
2020-02-25 12:03:46 -08:00
Zach Brown
dee9fbcf66 scoutfs: use packed extents and bitmaps
The btree forest item storage doesn't have as much item granular state
as the item cache did.  The item cache could tell if a cached item was
populated from persistent storage or was created in memory.  It could
simply remove created items rather than leaving behind a deletion item.

The cached btree blocks in the btree forest item storage mechanism can't
do this.  It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.

This created a problem with the extent storage we were using.  The
individual extent items were stored with a key set to the last logical
block of their extent.  As extents grew or shrank they often were
deleted and created at different key values during a transaction.  In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent.  Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.

Streaming writes would operate on O(n) for every extent operation.  It
got to be out of hand.  This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.

For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.

Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items.  The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly.  It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.

Previously the client and server would exchange extents with network
messages.  Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction.  The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.

The server no longer has to manage free extents.  It transfers block
bitmap items between trees around commits.   All of its extent
manipulation can be removed.

The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00
Zach Brown
858dad1d51 scoutfs: add forest subsystem
The forest code presents a consistent item interface that's implemented
on top of a forest of persistent btrees.

Signed-off-by: Zach Brown <zab@versity.com>
2020-01-17 11:21:36 -08:00