Like a lot of places in the server, get_log_trees() doesn't have the
tools in needs to safely unwind partial changes in the face of an error.
In the worst case, it can have moved extents from the mount's log_trees
item into the server's main data allocator. The dirty data allocator
reference is in the super block so it can be written later. The dirty
log_trees reference is on stack, though, so it will be thrown away on
error. This ends up duplicating extents in the persistent structures
because they're written in the new dirty allocator but still remain in
the unwritten source log_trees allocator.
This change makes it harder for that to happen. It dirties the
log_trees item and always tries to update so that the dirty blocks are
consistent if they're later written out. If we do get an error updating
the item we throw an assertion. It's not great, but it matches other
similar circumstances in other parts of the server.
Signed-off-by: Zach Brown <zab@versity.com>
Client log_trees allocator btrees can build up quite a number of
extents. In the right circumstances fragmented extents can have to
dirty a large number of paths to leaf blocks in the core allocator
btrees. It might not be possible to dirty all the blocks necessary to
move all the extents in one commit.
This reworks the extent motion so that it can be performed in multiple
commits if the meta allocator for the commit runs out while it is moving
extents. It's a minimal fix with as little disruption to the ordering
of commits and locking as possible. It simply bubbles up an error when
the allocators run out and retries functions that can already be retried
in other circumstances.
Signed-off-by: Zach Brown <zab@versity.com>
We're seeing allocator motion during get_log_trees dirty quite a lot of
blocks, which makes sense. Let's continue to up the budget. If we
still need significantly larger budgets we'll want to look into capping
the dirty block use of the allocator extent movers which will mean
changing callers to support partial progress.
Signed-off-by: Zach Brown <zab@versity.com>
After we've merged a log btree back into the main fs tree we kick off
work to free all its blocks. This would fully fill the transactions
free blocks list before stopping to apply the commit.
Consuming the entire free list makes it hard to have concurrent holders
of a commit who also want to free things. This chnages the log btree
block freeing to limit itself to a fraction of the budget that each
holder gets. That coarse limit avoids us having to precisely account
for the allocations and frees while modifying the freeing item while
still freeing many blocks per commit.
Signed-off-by: Zach Brown <zab@versity.com>
Server commits use an allocator that has a limited number of available
metadata blocks and entries in a list for freed blocks. The allocator
is refilled between commits. Holders can't fully consume the allocator
during the commit and that tended to work out because server commit
holders commit before sending responses. We'd tend to commit frequently
enough that we'd get a chance to refill the allocators before they were
consumed.
But there was no mechanism to ensure that this would be the case.
Enough concurrent server holders were able to fully consume the
allocators before committing. This causes scoutfs_meta_alloc and _free
to return errors, leading the server to fail in the worst cases.
This changes the server commit tracking to use more robust structures
which limit the number of concurrent holders so that the allocators
aren't exhausted. The commit_users struct stops holders from making
progress once the allocators don't have room for more holders. It also
lets us stop future holders from making progress once the commit work
has been queued. The previous cute use of a rwsem didn't allow for
either of these protections.
We don't have precise tracking of each holder's allocation consumption
so we don't try and reserve blocks for each holder. Instead we have a
maxmimum consumption per holder and make sure that all the holders can't
consume the allocators if they all use their full limit.
All of this requires the holding code paths to be well behaved and not
use more than the per-hold limit. We add some debugging code to print
the stacks of holders that were active when the total holder limit was
exceeded. This is the motivation for having state in the holders. We
can record some data at the time their hold started that'll make it a
little easier to track down which of the holders exceeded their limit.
Signed-off-by: Zach Brown <zab@versity.com>
There was a brief time where we exported the ability to hold and apply
commits outside of the main server code. That wasn't a great idea, and
the few users have seen been reworked to not require directly
manipulating server transactions, so we can reduce risk and make these
functions private again.
Signed-off-by: Zach Brown <zab@versity.com>
Quorum members will try to elect a new leader when they don't receive
heartbeats from the currently elected leader. This timeout is short to
encourage restoring service promptly.
Heartbeats are sent from the quorum worker thread and are delayed while
it synchronously starts up the server, which includes fencing previous
servers. If fence requests take too long then heartbeats will be
delayed long enough for remaining quorum members to elect a new leader
while the recently elected server is still busy fencing.
To fix this we decouple server startup from the quorum main thread.
Server starting and stopping becomes asynchronous so the quorum thread
is able to send heartbeats while the server work is off starting up and
fencing.
The server used to call into quorum to clear a flag as it exited. We
remove that mechanism and have the server maintain a running status that
quorum can query.
We add some state to the quorum work to track the asynchronous state of
the server. This lets the quorum protocol change roles immediately as
needed while remembering that there is a server running that needs to be
acted on.
The server used to also call into quorum to update quorum blocks. This
is a read-modify-write operation that has to be serialized. Now that we
have both the server starting up and the quorum work running they both
can't perform these read-modify-write cycles. Instead we have the
quorum work own all the block updates and it queries the server status
to determine when it should update the quorum block to indicate that the
server has fenced or shut down.
Signed-off-by: Zach Brown <zab@versity.com>
The mount options code is some of the oldest in the tree and is weirdly
split between options.c and super.c. This cleans up the options code,
moves it all to options.c, and reworks it to be more in line with the
modern subsystem convenction of storing state in an allocated info
struct.
Rather than putting the parsed options in the super for everyone to
directly reference we put them in the private options info struct and
add a locked read function. This will let us add sysfs files to change
mount options while safely serializing with readers.
All the users of mount options that used to directly reference the
parsed struct now call the read function to get a copy. They're all
small local changes except for quorum which saves a static copy of the
quorum slot number because it references it in so many places and relies
on it not changing.
Finally, we remove the empty debugfs "options" directory.
Signed-off-by: Zach Brown <zab@versity.com>
The server's log merge complete request handler was considering the
absence of the client's original request as a failure. Unfortunately,
this case is possible if a previous server successfully completed the
client's request but the response was lost because it stopped for
whatever reason.
The failure was being logged as a hard error to the console which was
causing tests to occasionally fail during server failover that hit just
as the log merge completion was being processed.
The error was being sent to the client as a response, we just need to
silence the message for these expected but rare errors.
We also fix the related case where the server printed the even more
harsh WARN_ON if there was a next original request but it wasn't the one
we expected to find from our requesting client.
Signed-off-by: Zach Brown <zab@versity.com>
This reverts commit 61ad844891.
This fix was trying to ensure that lock recovery response handling
can't run after farewell calls reclaim_rid() by jumping through a bunch
of hoops to tear down locking state as the first farewell request
arrived.
It introduced very slippery use after free during shutdown. It appears
that it was from drain_workqueue() previously being able to stop
chaining work. That's no longer possible when you're trying to drain
two workqueues that can queue work in each other.
We found a much clearer way to solve the problem so we can toss this.
Signed-off-by: Zach Brown <zab@versity.com>
We recently found that the server can send a farewell response and try
to tear down a client's lock state while it was still in lock recovery
with the client. The lock recovery response could add a lock
for the client after farell's reclaim_rid() had thought the client was
gone forever and tore down its locks.
This left a lock in the lock server that wasn't associated with any
clients and so could never be invalidated. Attempts to acquire
conflicting locks with it would hang forever, which we saw as hangs in
testing with lots of unmounting.
We tried to fix it by serializing incoming request handling and
forcefully clobbering the client's lock state as we first got
the farewell request. That went very badly.
This takes another approach of trying to explicitly wait for lock
recovery to finish before sending farewell responses. It's more in
line with the overall pattern of having the client be up and functional
until farewell tears it down.
With this in place we can revert the other attempted fix that was
causing so many problems.
Signed-off-by: Zach Brown <zab@versity.com>
The server's little set_shutting_down() helper accidentally used a read
barrier instead of a write barrier.
Signed-off-by: Zach Brown <zab@versity.com>
Tear down client lock server state and set a boolean so that
there is no race between client/server processing lock recovery
at the same time as farewell.
Currently there is a bug where if server and clients are unmounted
then work from the client is processed out of order, which leaves
behind a server_lock for a RID that no longer exists.
In order to fix this we need to serialize SCOUTFS_NET_CMD_FAREWELL
in recv_worker.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
Only BUG_ON for inconsistency and not do it for commit errors
or failure to delete the original request.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
In scoutfs_server_worker we do not properly handle the clean up
of _block_writer_init and alloc_init. On error paths we can clean
up the context if either of thoes are initialized we can call
alloc_prepare_commit or writer_forget_all to ensure we drop
the block references and clear the dirty status of all the blocks
in the writer.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
In order to safely free blocks we need to first dirty
the work. This allows for resume later on without a double
free.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
While checking in on some other code I noticed that we have lingering
allocator and writer contexts over in the lock server. The lock server
used to manage its own client state and recovery. We've sinced moved
that into shared recov functionality in the server. The lock server no
longer manipulates its own btrees and doesn't need these unused
references to the server's contexts.
Signed-off-by: Zach Brown <zab@versity.com>
There are a few bad corner cases in the state machine that governs how
client transactions are opened, modified, and committed.
The worst problem is on the server side. All server request handlers
need to cope with resent requests without causing bad side effects.
Both get_log_trees and commit_log_trees would try to fully processes
resent requests. _get_log_trees() looks safe because it works with the
log_trees that was stored previously. _commit_log_trees() is not safe
because it can rotate out the srch log file referenced by the sent
log_trees every time it's processed. This could create extra srch
entries which would delete the first instance of entries. Worse still,
by injecting the same block structure into the system multiple times it
ends up causing multiple frees of the blocks that make up the srch file.
The client side problems are slightly different, but related. There
aren't strong constraints which guarantee that we'll only send a commit
request after a get request succeeds. In crazy circumstances the
commit request in the write worker could come before the first get in
mount succeeds. Far worse is that we can send multiple commit requests
for one transaction if it changes as we get errors during multiple
queued write attempts, particularly if we get errors from get_log_trees
after having successfully committed.
This hardens all these paths to ensure a strict sequence of
get_log_trees, transaction modification, and commit_log_trees.
On the server we add *_trans_seq fields to the log_trees struct so that
both get_ and commit_ can see that they've already prepared a commit to
send or have already committed the incoming commit, respectively. We
can use the get_trans_seq field as the trans_seq of the open transaction
and get rid of the entire seperate mechanism we used to have for
tracking open trans seqs in the clients. We can get the same info by
walking the log_trees and looking at their *_trans_seq fields.
In the client we have the write worker immediately return success if
mount hasn't opened the first transaction. Then we don't have the
worker return to allow further modification until it has gotten success
from get_log_trees.
Signed-off-by: Zach Brown <zab@versity.com>
Our statfs implementation had clients reading the super block and using
the next free inode number to guess how many inodes there might be. We
are very aggressive with giving directories private pools of inode
numbers to allocate from. They're often not used at all, creating huge
gaps in allocated inode numbers. The ratio of the average number of
allocations per directory to the batch size given to each directory is
the factor that the used inode count can be off by.
Now that we have a precise count of active inodes we can use that to
return accurate counts of inodes in the files fields in the statfs
struct. We still don't have static inode allocation so the fields don't
make a ton of sense. We fake the total and free count to give a
reasonable estimate of the total files that doesn't change while the
free count is calculated from the correct count of used inodes.
While we're at it we add a request to get the summed fields that the
server can cheaply discover in cache rather than having the client
always perform read IOs.
Signed-off-by: Zach Brown <zab@versity.com>
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct. Client transactions track the change in
inode count as they create and delete inodes. The log_trees delta is
added to the count in the super as finalized log_trees are deleted.
Signed-off-by: Zach Brown <zab@versity.com>
We had previously started on a relatively simple notion of an
interoperability version which wasn't quite right. This fleshes out
support for a more functional format version. The super blocks have a
single version that defines behaviour of the running system. The code
supports a range of versions and we add some initial interfaces for
updating the version while the system is offline. All of this together
should let us safely change the underlying format over time.
Signed-off-by: Zach Brown <zab@versity.com>
As the server comes up it needs to fence any previous servers before it
assumes exclusive access to the device. If fencing fails it can leave
fence requests behind. The error path for these very early failures
didn't shut down fencing so we'd have lingering fence requests span the
life cycle of server startup and shutdown. The next time the server
starts up in this mount it can try to create the fence request again,
get an error because a lingering one already exists, and immediately
shut down.
The result is that fencing errors that hit that initial attempt during
server startup can become persistent fencing errors for the lifetime of
that mount, preventing it from every successfully starting the server.
Moving the fence stop call to hit all exiting error paths consistently
clean up fence requests and avoid this problem. The next server
instance will get a chance to process the fence request again. It might
well hit the same error, but at least it gets a chance.
Signed-off-by: Zach Brown <zab@versity.com>
In some of the allocation paths there are goto statements
that end up calling kfree(). That is fine, but in cases
where the pointer is not initially set to NULL then we
might have an undefined behavior. kfree on a NULL pointer
does nothing, so essentially these changes should not
change behavior, but clarifies the code path better.
Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>
We have a problem where items can appear to go backwards in time because
of the way we chose which log btrees to finalize and merge.
Because we don't have versions in items in the fs_root, and even might
not have items at all if they were deleted, we always assume items in
log btrees are newer than items in the fs root.
This creates the requirement that we can't merge a log btree if it has
items that are also present in older versions in other log btrees which
are not being merged. The unmerged old item in the log btree would take
precedent over the newer merged item in the fs root.
We weren't enforcing this requirement at all. We used the max_item_seq
to ensure that all items were older than the current stable seq but that
says nothing about the relationship between older items in the finalized
and active log btrees. Nothing at all stops an active btree from having
an old version of a newer item that is present in another mount's
finalized log btree.
To reliably fix this we create a strict item seq discontinuity between
all the finalized merge inputs and all the active log btrees. Once any
log btree is naturally finalized the server forced all the clients to
group up and finalize all their open log btrees. A merge operation can
then safely operate on all the finalized trees before any new trees are
given to clients who would start using increasing items seqs.
Signed-off-by: Zach Brown <zab@versity.com>
The server doesn't give us much to go on when it gets an error handling
requests to work with log trees from the client. This adds a lot of
specific error messages so we can get a better understanding of
failures.
Signed-off-by: Zach Brown <zab@versity.com>
We were trusting the rid in the log trees struct that the client sent.
Compare it to our recorded rid on the connection and fail if the client
sent the wrong rid.
Signed-off-by: Zach Brown <zab@versity.com>
server_get_log_trees() sets the low flag in a mount's meta_avail
allocator, triggering enospc for any space consuming allocatins in the
mount, if the server's global meta_vail pool falls below the reserved
block count. Before each server transaction opens we swap the global
meta_avail and meta_freed allocators to ensure that the transaction has
at least the reserved count of blocks available.
This creates a risk of premature enospc as the global meta_avail pool
drains and swaps to the larger meta_freed. The pool can be close to the
reserved count, perhaps at it exactly. _get_log_trees can fill the
client's mount, even a little, and drop the global meta_avail total
under the reserved count, triggering enospc, even though meta_Freed
could have had quite a lot of blocks.
The fix is to ensure that the global meta_avail has 2x the reserved
count and swapping if it falls under that. This ensures that a server
transaction can consume an entire reserved count and still have enough
to avoid triggering enospc.
This fixes a scattering of rare premature enospc returns that were
hitting during tests. It was rare for meta_avail to fall just at the
reserved count and for get_log_trees to have to refill the client
allocator, but it happened.
Signed-off-by: Zach Brown <zab@versity.com>
Add a scoutfs command that uses an ioctl to send a request to the server
to safely use a device that has grown.
Signed-off-by: Zach Brown <zab@versity.com>
As subsystems were built I tended to use interruptible waits in the hope
that we'd let users break out of most waits.
The reality is that we have significant code paths that have trouble
unwinding. Final inode deletion during iput->evict in a task is a good
example. It's madness to have a pending signal turn an inode deletion
from an efficient inline operation to a deferred background orphan inode
scan deletion.
It also happens that golang built pre-emptive thread scheduling around
signals. Under load we see a surprising amount of signal spam and it
has created surprising error cases which would have otherwise been fine.
This changes waits to expect that IOs (including network commands) will
complete reasonably promptly. We remove all interruptible waits with
the notable exception of breaking out of a pending mount. That requires
shuffling setup around a little bit so that the first network message we
wait for is the lock for getting the root inode.
Signed-off-by: Zach Brown <zab@versity.com>
We recently fixed problems sending omap responses to originating clients
which can race with the clients disconnecting. We need to handle the
requests sent to clients on behalf of an origination request in exactly
the same way. The send can race with the client being evicted. It'll
be cleaned after the race is safely ignored by the client's rid being
removed from the server's request tracking.
Signed-off-by: Zach Brown <zab@versity.com>
The quorum service shuts down if it sees errors that mean that it can't
do its job.
This is mostly fatal errors gathering resources at startup or runtime IO
errors but it was also shutting down if server startup fails. That's
not quite right. This should be treated like the server shutting down
on errors. Quorum needs to stay around to participate in electing the
next server.
Fence timeouts could trigger this. A quorum mount could crash, the
next server without a fence script could have a fence request timeout
and shutdown, and now the third remaining server is left to indefinitely
send vote requests into the void.
With this fixed, continuing that example, the quorum service in the
second mount remains to elect the third server with a working fence
script after the second server shuts down after its fence request times
out.
Signed-off-by: Zach Brown <zab@versity.com>
The omap message lifecycle is a little different than the server's usual
handling that sends a response from the request handler. The response
is sent long after the initial receive handler is pinning the connection
to the client. It's fine for the response to be dropped.
The main server request handler handled this case but other response
senders didn't. Put this error handling in the server response sender
itself so that all callers are covered.
Signed-off-by: Zach Brown <zab@versity.com>
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free. This adds support for
returning ENOSPC to client posix allocators as free space gets low.
For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space. The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks. In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing). When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.
Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.
For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.
The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.
We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.
We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.
And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.
Signed-off-by: Zach Brown <zab@versity.com>
The log merging work deletes log trees items once their item roots are
merged back into the fs root. Those deleted items could still have
populated srch files that would be lost. We force rotation of the srch
files in the items as they're reclaimed to turn them into rotated srch
files that can be compacted.
Signed-off-by: Zach Brown <zab@versity.com>
This adds the server processing side of the btree merge functionality.
The client isn't yet sending the log_merge messages so no merging will
be performed.
The bulk of the work happens as the server processess a get_log_merge
message to build a merge request for the client. It starts a log merge
if one isn't in flight. If one is in flight it checks to see if it
should be spliced and maybe finished. In the common case it finds the
next range to be merged and sends the request to the client to process.
The commit_log_merge handler is the completion side of that request. If
the request failed then we unwind its resources based on the stored
request item. If it succeeds we record it in an item for get_
processing to splice eventually.
Then we modify two existing server code paths.
First, get_log_tree doesn't just create or use a single existing log
btree for a client mount. If the existing log btree is large enough it
sets its finalized flag and advances the nr to use a new log btree.
That makes the old finalized log btree available for merging.
Then we need to be a bit more careful when reclaiming the open log btree
for a client. We can't use next to find the only open log btree, we use
prev to find the last and make sure that it isn't already finalized.
Signed-off-by: Zach Brown <zab@versity.com>
Extract part of the get_last_seq handler into a call that finds the last
stable client transaction seq. Log merging needs this to determine a
cutoff for stable items in log btrees.
Signed-off-by: Zach Brown <zab@versity.com>
Rename the item version to seq and set it to the max of the transaction
seq and the lock's write_seq. This lets btree item merging chose a seq
at which all dirty items written in future commits must have greater
seqs. It can drop the seqs from items written to the fs tree during
btree merging knowing that there aren't any older items out in
transactions that could be mistaken for newer items.
Signed-off-by: Zach Brown <zab@versity.com>
Rename the write_version lock field to write_seq and get it from the
core seq in the super block.
We're doing this to create a relationship between a client transaction's
seq and a lock's write_seq. New transactions will have a greater seq
than all previously granted write locks and new write locks will have a
greater seq than all open transactions. This will be used to resolve
ambiguities in item merging as transaction seqs are written out of order
and write locks span transactions.
Signed-off-by: Zach Brown <zab@versity.com>
Get the next seq for a client transaction from the core seq in the super
block. Remove its specific next_trans_seq field.
While making this change we switch to only using le64 in the network
message payloads, the rest of the processing now uses natural u64s.
Signed-off-by: Zach Brown <zab@versity.com>
Add a new seq field to the super block which will be the source of all
incremented seqs throughout the system. We give out incremented seqs to
callers with an atomic64_t in memory which is synced back to the super
block as we commit transactions in the server.
Signed-off-by: Zach Brown <zab@versity.com>
When we moved to the current allocator we fixed up the server commit
path to initialize the pair of allocators as a commit is finished rather
than before it starts. This removed all the error cases from
hold_commit. Remove the error handling from hold_commit calls to make
the system just a bit simpler.
Signed-off-by: Zach Brown <zab@versity.com>
The core quorum work loop assumes that it has exclusive access to its
slot's quorum block. It uniquely marks blocks it writes and verifies
the marks on read to discover if another mount has written to its slot
under the assumption that this must be a configuration error that put
two mounts in the same slot.
But the design of the leader bit in the block violates the invariant
that only a slot will write to its block. As the server comes up and
fences previous leaders it writes to their block to clear their leader
bit.
The final hole in the design is that because we're fencing mounts, not
slots, each slot can have two mounts in play. An active mount can be
using the slot and there can still be a persistent record of a previous
mount in the slot that crashed that needs to be fenced.
All this comes together to have the server fence an old mount in a slot
while a new mount is coming up. The new mount sees the mark change and
freaks out and stops participating in quorum.
The fix is to rework the quorum blocks so that each slot only writes to
its own block. Instead of the server writing to each fenced mount's
slot, it writes a fence event to its block once all previous mounts have
been fenced. We add a bit of bookkeeping so that the server can
discover when all block leader fence operations have completed. Each
event gets its own term so we can compare events to discover live
servers.
We get rid of the write marks and instead have an event that is written
as a quorum agent starts up and is then checked on every read to make
sure it still matches.
Signed-off-by: Zach Brown <zab@versity.com>
Add the peername of the client's connected socket to its mounted_client
item as it mounts. If the client doesn't recover then fencing can use
the IP to find the host to fence.
Signed-off-by: Zach Brown <zab@versity.com>
I saw a confusing hang that looked like a lack of ordering between
a waker setting shutting_down and a wait event testing it after
being woken up. Let's see if more barriers help.
Signed-off-by: Zach Brown <zab@versity.com>
The server is responsible for calling the fencing subsystem. It is the
source of fencing requests as it decides that previous mounts are
unresponsive. It is responsible for reclaiming resources for fenced
mounts and freeing their associated fence request.
Signed-off-by: Zach Brown <zab@versity.com>
Add super_ops->umount_begin so that we can implement a forced unmount
which tries to avoid issuing any more network or storage ops. It can
return errors and lose unsynchronized data.
Signed-off-by: Zach Brown <zab@versity.com>
Add the data_alloc_zone_blocks volume option. This changes the
behaviour of the server to try and give mounts free data extents which
fall in exclusive fixed-size zones.
We add the field to the scoutfs_volume_options struct and add it to the
set_volopt server handler which enforces constrains on the size of the
zones.
We then add fields to the log_trees struct which records the size of the
zones and sets bits for the zones that contain free extents in the
data_avail allocator root. The get_log_trees handler is changed to read
all the zone bitmaps from all the items, pass those bitmaps in to
_alloc_move to direct data allocations, and finally update the bitmaps
in the log_trees items to cover the newly allocated extents. The
log_trees data_alloc_zone fields are cleared as the mount's logs are
reclaimed to indicate that the mount is no longer writing to the zone.
The policy mechanism of finding free extents based on the bitmaps is
ipmlemented down in _data_alloc_move().
Signed-off-by: Zach Brown <zab@versity.com>
Add parameters so that scoutfs_alloc_move() can first search for source
extents in specified zones. It uses relatively cheap searches through
the order items to find extents that intersect with the regions
described by the zone bitmaps.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce global volume options. They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.
Signed-off-by: Zach Brown <zab@versity.com>
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount. This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.
We fix this by adding cached inode tracking. Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.
This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group. Removing many files in a group will only lock and get
the open map once per group.
Signed-off-by: Zach Brown <zab@versity.com>