Convert metadata block and file data extent allocations to use the radix
allocator.
Most of this is simple transitions between types and calls. The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix. We remove the code and fields that were responsible for adding
uninitialized data and metadata.
The rest of the unused block allocator code is only ifdefed out. It'll
be removed in a separate patch to reduce noise here.
Signed-off-by: Zach Brown <zab@versity.com>
Add the allocator that uses bits stored in the leaves of a cow radix.
It'll replace two metadata and data allocators that were previously
storing allocation bitmap fragments in btree items.
Signed-off-by: Zach Brown <zab@versity.com>
Add a call to move a block's location in the cache without failure. The
radix allocator is going to use this to dirty radix blocks while making
atomic changes to multipls paths through multiple radix trees.
Signed-off-by: Zach Brown <zab@versity.com>
Switch the block cache from indexing blocks in a radix tree to using an
rbtree. We lose the RCU lookups but we gain being able to move blocks
around in the cache without allocation failure. And we no longer have
the problem of not being able to index large blocks with a 32bit long
radix key.
Signed-off-by: Zach Brown <zab@versity.com>
Add functions for callers to maintain a visited bit in cached blocks.
The radix allocator is going to use this to count the number of clean
blocks it sees across paths through the radix which can share parent
blocks.
Signed-off-by: Zach Brown <zab@versity.com>
The bloom block reading code forgot to test if the read block was stale.
It would trust whatever it read. Now the read when building up roots to
use can return stale and retry.
Signed-off-by: Zach Brown <zab@versity.com>
Update the summary of the benefit we get from concurrent per-mount
commits. Instead of describing it specifically in terms of LSM we
abstract it out a bit to make it also true of writing per-mount log
btrees.
Signed-off-by: Zach Brown <zab@versity.com>
The forest item iterator was missing items. Picture the following
search pattern:
- find a candidate item to return in a root
- ignore a greater candidate to return in another root
- find the first candidates item's deletion in another root
The problem was that finding the deletion item didn't reset the notion
that we'd found a key. The next item from the second root was never
used because the found key wasn't reset and that root had already
searched past the found key.
The core architectural problem is that iteration can't examine each item
only once given that keys and deletions can be randomly distributed
across the roots.
The most efficient way to solve the problem is to really sort the
iteration positions in each root and then walk those in order. We
get the right answer and pay some data structure overhead to perform
the minimum number of btree searches.
Signed-off-by: Zach Brown <zab@versity.com>
As we shut down the transaction tries to destroy any remaining dirty
blocks in its writer context. The block writer context was only
initialized by the client as it asked the server for the log trees.
This makes sure the writer is always initialized.
Signed-off-by: Zach Brown <zab@versity.com>
The xattr code had a static defintion of the largest part item that it
would create. Change it to be a function of the largest fs item
value that can be created and clean up the code a bit in the process.
Signed-off-by: Zach Brown <zab@versity.com>
We are no longer storing individual extents in items from multiple
places and indexed in multiple ways. We can remove this extent support
code.
Signed-off-by: Zach Brown <zab@versity.com>
The btree forest item storage doesn't have as much item granular state
as the item cache did. The item cache could tell if a cached item was
populated from persistent storage or was created in memory. It could
simply remove created items rather than leaving behind a deletion item.
The cached btree blocks in the btree forest item storage mechanism can't
do this. It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.
This created a problem with the extent storage we were using. The
individual extent items were stored with a key set to the last logical
block of their extent. As extents grew or shrank they often were
deleted and created at different key values during a transaction. In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent. Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.
Streaming writes would operate on O(n) for every extent operation. It
got to be out of hand. This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.
For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.
Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items. The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly. It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.
Previously the client and server would exchange extents with network
messages. Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction. The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.
The server no longer has to manage free extents. It transfers block
bitmap items between trees around commits. All of its extent
manipulation can be removed.
The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.
Signed-off-by: Zach Brown <zab@versity.com>
We need a way to compare two items in different log btrees and learn
which is the most recent. Each time we grant a new write lock we give
it a larger write version. Items store the version of the lock they're
written under. Readers can now easily see which item is newer.
This is a trivial initial implementation which is not consistent across
unmount or server failover. We'll need to recover the greatest
write_version from locks during recovery and from log trees as the
server starts up.
Signed-off-by: Zach Brown <zab@versity.com>
Add a btree_update variant which will insert the item if a previous
wasn't found instead of returning -ENOENT. This saves callers from
having to lookup befure updating to discover if they should call _create
or _update.
Signed-off-by: Zach Brown <zab@versity.com>
Use simple buffer_heads to read and write the super. After getting rid
of the lsm code this would be the last user of our bio helpers. With
this converted we can remove the bio helpers along with the rest of the
lsm code.
Signed-off-by: Zach Brown <zab@versity.com>
Transaction commit now has to ask the forest to write the btrees during
a transaction commit instead of writing dirty items in segments. It
also determines if holds fit in the dirty transaction by looking at
dirty btree blocks instead of item counts.
Locking no longer has to invalidate a private item cache because the
forest paths use the btree block cache where inconsistency is discovered
and invalidated as blocks are read.
Signed-off-by: Zach Brown <zab@versity.com>
Convert fs callers to work with the btree forest calls instead of the
lsm item cache calls. This is mostly a mechanical syntax conversion.
The inode dirtying path does now update the item rather than simply
dirtying it.
Signed-off-by: Zach Brown <zab@versity.com>
The forest code presents a consistent item interface that's implemented
on top of a forest of persistent btrees.
Signed-off-by: Zach Brown <zab@versity.com>
Add a simple start of a command that the client will use to commit its
dirty trees. This'll be expanded in the future to include more trees
and block allocation.
Signed-off-by: Zach Brown <zab@versity.com>
Teach the server to maintain and use its block allocator and writer
contexts when operating on its btrees.
The manifest tree operations aren't updated because they're about to be
removed.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the btree to use our block cache, block allocation, and the
caller's explicit dirty block tracking writer context instead of the
ring. This is in preparation for the btree forest format where there
are concurrent multiple writers of independent dynamically sized btrees
instead of only the server writing one btree with a fixed maximum size.
All the machinery for tracking dirty blocks in the ring and migrating is
no longer needed.
Signed-off-by: Zach Brown <zab@versity.com>
Previous versions of the system had a simple block cache. This brings
it back with support for blocks that are larger than page size, a more
efficient LRU, and an explicit writer context.
Signed-off-by: Zach Brown <zab@versity.com>
The btree block header had some aggressively small values that limited
the largest block size that could be supported. Use larger 32bit values
so that we can support larger block sizes.
Signed-off-by: Zach Brown <zab@versity.com>
It turns out that the sorting performed by btree block item compaction
was pretty expensive. It's cheaper to keep the items packed at the end
of the block by moving earlier items towards the back of the block as
interior items are deleted. When the items are always packed at the end
of the block we no longer need to track fragmented free space and can
remove the 'free_reclaim' btree block field.
This brought the bulk empty file create rate up by about 20%.
Signed-off-by: Zach Brown <zab@versity.com>
The previous _btree_update implementation required that the new value be
the same length as the old value. This change allows a new updated item
to be a different length. It performs the btree walk assuming that the
item will be larger so that there's room for the difference. It
doesn't search for the size of the existing item so it doesn't know if
the new item is smaller. It might leave the dirty leaf under the low
water mark, which is fine.
Signed-off-by: Zach Brown <zab@versity.com>
The item code had a manual comparison of lock modes when testing if a
given access was protected by a held lock. Let's offer a proper
interface from the lock code.
Signed-off-by: Zach Brown <zab@versity.com>
The modern upstream kernel has a ->iterate() readdir file_operattions
method which takes a context and calls dir_emit(). We add some
kernelcompat helpers to juggle the various function definitions, types,
and arguments to support both the old ->readdir(filldir) and the new
->iterate(ctx) interfaces.
Signed-off-by: Zach Brown <zab@versity.com>
Usually lock_free() is called as users finish using a lock and when its
state shows that it is idle and won't be freed out from under another
use.
During shutdown we manually call lock_free() on all locks because
shutdown promises that there will be no more lock users, including
networking callbacks. But there is a case where network requests can be
pending and we shutdown before waiting for their reply. This trips
BUG_ON assertions in lock_free() that would otherwise catch unsafe calls
of lock_free().
This is easiest to reproduce by interrupting a mount (which is waiting
on a lock to read the root inode).
The fix is to update each lock's state during shutdown to reflect the
promise made by shutdown. Requests aren't actually pending because
we've shutdown networking befrore getting here.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs f.000000.r.200d94 error: Unknown or malformed option, "server_address=192.168.31.220"
Should be server_addr, fix it.
Cc: Zach Brown <zab@versity.com>
Fixes: 10c32("scoutfs: update README.md for server_addr")
Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>
Update the instructions for starting up a system with the quorum count
mkfs option and server_addr mount option.
Signed-off-by: Zach Brown <zab@versity.com>
In a previous commit ("1bd094f scoutfs: migrate dirty btree blocks
during wrap") we fixed a bug where we wouldn't migrate blocks from the
old half of the ring because they were already dirty in memory. The fix
accidentally introduced the case where we wouldn't dirty blocks when
migrating if they were already in the current half.
We always have to dirty parent blocks when migrating because we might
need to modify them to reference the new location of child blocks that
are migrated. This bug meant that we'd modify clean blocks in memory
which would never make it to the persistent copy. The system could
survive as long as it never read that block back from its persistent
location. To see the corruption you'd either need tall btrees to be
shared between mounts or you'd need one mount to evict its clean
(actually modified) cached btree block under memory pressure and then
try to read it back.
Signed-off-by: Zach Brown <zab@versity.com>
It's possible to trigger stale segment reads during compaction. This
shouldn't be possible during regular operation because the server
protects the input segments while the compaction is pending. Stale
segment reads can only happen to client reads which aren't serialized
with segment allocation and writes.
Warn if we see a stale segment read during compaction. It means that we
either have a bug in the server or someone armed a stale segment read
trigger that hit compaction.
Signed-off-by: Zach Brown <zab@versity.com>
Lockdep gets angry when we try to destroy an accepted conn workqueue
from within work in a listening conn's workqueue. It doesn't recognize
that they have a hierarchical relationship that maintains a consistent
order and we can't get at the workqueue lockdep_map to set subclasses.
We add a destroy workqueue which will have its own class.
Signed-off-by: Zach Brown <zab@versity.com>
Lock recovery is perfectly normal if a server is unmounted and another
is elected to take its place. Turn the lock recovery message into an
info message instead of a warning and add another info message when lock
recovery is complete.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can fill a user struct with file system info. We're
going to use this to find the fsid and rid of a mount.
Signed-off-by: Zach Brown <zab@versity.com>
Use the client's rid in networking instead of the node_id.
The node_id no longer has to be allocated by the server and sent in the
greeting. Instead the client sends it to the server in its greeting.
The server then uses the client's announced rid just like it used to use
the its node_id. It's used to record clients in the btree and to
identify clients in sending and receive processing.
The use of the rid in networking calls makes its way to locking and
compaction which now use the rid to identify clients intead of the
node_id.
Signed-off-by: Zach Brown <zab@versity.com>
The current quorum voting implementatoin had some rough edges that
increased the complexity of the system and introduced undesirable
failure modes. We can keep the same basic pattern but move
functionality around a few places, and rethink the quorum voting, to end
up with a meaningfully simpler system.
The motivation for this work was to remove the need to provide a
uniq_name option for every mount instance.
The first big change is to remove the idea of static configuration slots
for mounts. This removes the use of uniq_name. Mounts now simply have
a server_addr mount option instead of using their uniq_name to find
their address in the configuration.
The server can't check the configuration to see if a given connected
client's name is found in the quorum config. Clients can set a flag in
their sent greeting which indicates that they're a voter. This removes
the uniq_name from the greeting and mounted client records.
Without a static configuration mounts no longer have dedicated block
locations to write to. We increase the size of the region of quorum
blocks and have voters simply write to a random block. Overwriting vote
blocks is OK because we move from heartbeating design patterns to a
protocol strongly based on raft's election. We're using quorum blocks
to communicate votes instead of network messages and overwriting blocks
is analagous to lossy networks droping vote messages in the raft
election protocol.
We were using the dedicated per-mount quorum blocks to track mounts that
had been elected and needed to be fenced. We no longer have that
storage so instead we add the idea of an election log that is stored in
every voting block. Readers merge the logs from all the blocks they
read and write the resulting merged log in their block.
With no static quorum configuration we no longer have to worry about the
complexity of changing the slot configurations while they're in use.
The only persistent configuration is the number of votes a candidate
needs to be elected by a quorum.
It was a mistake to use quorum voting blocks to communicate state
between the server and the quorum voters. We can easily move the
unmount_barrier, server address, and fencing state from the quorum
blocks into the super block. The server no longer needs the quorum
election info struct to be able to later write its quorum block. It
instead writes a few fields in the super. There's only one place where
clients need to look to find out who they should connect to or if they
can finish unmount.
Signed-off-by: Zach Brown <zab@versity.com>