Each of the different block types had a reading function that read a
block and then checked their reference struct for their block type.
This gets rid of each block reference type and has a single block_ref
type which is then checked by a single ref reading function in the block
core. By putting ref checking in the core we no longer have to export
checking the block header crc, verifying headers, invalidating blocks,
or even reading raw blocks themseves. Everyone reads refs and leaves
the checking up to the core.
The changes don't have a significant functional effect. This is mostly
just changing types and moving code around. (There are some changes to
visible counters.)
This shares code, which is nice, but this is putting the block reference
checking in one place in the block core so that in a few patches we can
fix problems with writers dirtying blocks that are being read.
Signed-off-by: Zach Brown <zab@versity.com>
The block cache wasn't safely racing readers walking the rcu radix_tree
and the shrinker walking the LRU list. A reader could get a reference
to a block that had been removed from the radix and was queued for
freeing. It'd clobber the free's llist_head union member by putting the
block back on the lru and both the read and free would crash as they
each corrupted each other's memory. We rarely saw this in heavy load
testing.
The fix is to clean up the use of rcu, refcounting, and freeing.
First, we get rid of the LRU list. Now we don't have to worry about
resolving racing accesses of blocks between two independent structures.
Instead of shrinking walking the LRU list, we can mark blocks on access
such that shrinking can walk all blocks randomly and expect to quickly
find candidates to shrink.
To make it easier to concurrently walk all the blocks we switch to the
rhashtable instead of the radix tree. It also has nice per-bucket
locking so we can get rid of the global lock that protected the LRU list
and radix insertion. (And it isn't limited to 'long' keys so we can get
rid of the check for max meta blknos that couldn't be cached.)
Now we need to tighten up when read can get a reference and when shrink
can remove blocks. We have presence in the hash table hold a refcount
but we make it a magic high bit in the refcount so that it can be
differentiated from other references. Now lookup can atomically get a
reference to blocks that are in the hash table, and shrinking can
atomically remove blocks when it is the only other reference.
We also clean up freeing a bit. It has to wait for the rcu grace period
to ensure that no other rcu readers can reference the blocks its
freeing. It has to iterate over the list with _safe because it's
freeing as it goes.
Interestingly, when reworking the shrinker I noticed that we weren't
scaling the nr_to_scan from the pages we returned in previous shrink
calls back to blocks. We now divide the input from pages back into
blocks.
Signed-off-by: Zach Brown <zab@versity.com>
Previously quorum configuration specified the number of votes needed to
elected the leader. This was an excessive amount of freedom in the
configuration of the cluster which created all sorts of problems which
had to be designed around.
Most acutely, though, it required a probabilistic mechanism for mounts
to persistently record that they're starting a server so that future
servers could find and possibly fence them. They would write to a lot
of quorum blocks and trust that it was unlikely that future servers
would overwrite all of their written blocks. Overwriting was always
possible, which would be bad enough, but it also required so much IO
that we had to use long election timeouts to avoid spurious fencing.
These longer timeouts had already gone wrong on some storage
configurations, leading to hung mounts.
To fix this and other problems we see coming, like live membership
changes, we now specifically configure the number and identity of mounts
which will be participating in quorum voting. With specific identities,
mounts now have a corresponding specific block they can write to and
which future servers can read from to see if they're still running.
We change the quorum config in the super block from a single
quorum_count to an array of quorum slots which specify the address of
the mount that is assigned to that slot. The mount argument to specify
a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr"
which specifies the mount's slot. The slot's address is used for udp
election messages and tcp server connections.
Now that we specifically have configured unique IP addresses for all the
quorum members, we can use UDP messages to send and receive the vote
mesages in the raft protocol to elect a leader. The quorum code doesn't
have to read and write disk block votes and is a more reasonable core
loop that either waits for received network messages or timeouts to
advance the raft election state machine.
The quorum blocks are now used for slots to store their persistent raft
term and to set their leader state. We have event fields in the block
to record the timestamp of the most recent interesting events that
happened to the slot.
Now that raft doesn't use IO, we can leave the quorum election work
running in the background. The raft work in the quorum members is
always running so we can use a much more typical raft implementation
with heartbeats. Critically, this decouples the client and election
life cycles. Quorum is always running and is responsible for starting
and stopping the server. The client repeatedly tries to connect to a
server, it has nothing to do with deciding to participate in quorum.
Finally, we add a quorum/status sysfs file which shows the state of the
quorum raft protocol in a member mount and has the last messages that
were sent to or received from the other members.
Signed-off-by: Zach Brown <zab@versity.com>
Add a new distinguishable return value (ENOBUFS) from allocator for if
the transaction cannot alloc space. This doesn't mean the filesystem is
full -- opening a new transaction may result in forward progress.
Alter fallocate and get_blocks code to check for this err val and retry
with a new transaction. Handling actual ENOSPC can still happen, of
course.
Add counter called "alloc_trans_retry" and increment it from both spots.
Signed-off-by: Andy Grover <agrover@versity.com>
[zab@versity.com: fixed up write_begin error paths]
We were using a trailing owner offset to iterate over btree item values
from the back of the block towards the front. We did this to reclaim
fragmented free space in a block to satisfy an allocation instead of
having to split the block, which is expensive mostly because it has to
allocate and free metadata blocks.
In the before times, we used to compact items by sorting items by their
offset, moving them, and then sorting them by their keys again. The
sorting by keys was expensive so we added these owner offsets to be able
to compact without sorting.
But the complexity of maintaining the owner metadata is not worth it.
We can avoid the expensive sorting by keys by allocating a temporary
array of item offsets and sorting only it by the value offset. That's
nice and quick, it was the key comparisons that were expensive. Then we
can remove the owner offset entirely, as well as the block header final
free region that compaction needed.
And we also don't compact as often in the modern era because we do the
bulk of our work in the item cache instead of in the btree, and we've
changed the split/merge/compaction heuristics to avoid constantly
splitting/merging/comapcting and an item population happens to hover
right around a shared threshold.
Signed-off-by: Zach Brown <zab@versity.com>
Previously the srch compaction work would output the entire compacted
file and delete the input files in one atomic commit. The server would
send the input files and an allocator to the client, and the client
would send back an output file and an allocator that included the
deletion of the input files. The server would merge in the allocator
and replace the input file items with the output file item.
Doing it this way required giving an enormous allocation pool to the
client in a radix, which would deal with recursive operations
(allocating from and freeing to the radix that is being modified). We
no longer have the radix allocator, and we use single block avail/free
lists instead of recursively modifying the btrees with free extent
items. The compaction RPC needs to work with a finite amount of
allocator resources that can be stored in an alloc list block.
The compaction work now does a fixed amount of work and a compaction
operation spans multiple work iterations.
A single compaction struct is now sent between the client and server in
the get_compact and commit_compact messages. The client records any
partial progress in the struct. The server writes that position into
PENDING items. It first searchs for pending items to give to clients
before searching for files to start a new compaction operation.
The compact struct has flags to indicate whether the output file is
being written or the input files are being deleted. The server manages
the flags and sets the input file deletion flag only once the result of
the compaction has been reflected in the btree items which record srch
files.
We added the progress fields to the compaction struct, making it even
bigger than it already was, so we take the time to allocate them rather
than declaring them on the stack.
It's worth mentioning that each operation now takes a reasonably bounded
amount of time will make it feasible to decide that it has failed and
needs to be fenced.
Signed-off-by: Zach Brown <zab@versity.com>
Use alloc_foreach to count the free blocks in all the allocators instead
of sending an RPC to the server. We cache the results so that constant
df calls don't generate a constant stream of IO.
Signed-off-by: Zach Brown <zab@versity.com>
The first pass of the item cache didn't try to reclaim freed space at
all. It would leave behind very sparse pages. The oldest of which
would be reclaimed by memory pressure.
While this worked, it created much more stress on the system than is
necessary. Splitting a page with one key also makes it hard to
calculate the boundaries of the split pages, given that the start and
end keys could be the single item.
This adds a header field which tracks the free space in item cache
pgaes. Free space is created before the alloc offset by removing items
from the rbtree, but also from shrinking item values when updating or
deleting items.
If we try to split a page with sufficient free space to insert the
largest possible item then we compact the page instead of splitting it.
We copy the items into the front of an unused page and swap the pages.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly. That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.
By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.
Most of this change is churn from changing allocator function and struct
names.
File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity. All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions. This now means
that fallocate and especially restoring offline extents can use larger
extents. Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.
The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing. The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks. This resulted in a lot of bugs. Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction. We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.
The server now only moves free extents into client allocators when they
fall below a low threshold. This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add an allocator which uses btree items to store extents. Both the
client and server will use this for btree blocks, the client will use it
for srch blocks and data extents, and the server will move extents
between the core fs allocator btree roots and the clients' roots.
Signed-off-by: Zach Brown <zab@versity.com>
Add infrastructure for working with extents. Callers provide callbacks
which operate on their extent storage while this code performs the
fiddly splitting and merging of extents. This layer doesn't have any
persitent structures itself, it only operates on native structs in
memory.
Signed-off-by: Zach Brown <zab@versity.com>
The percpu_counter library merges the per-cpu counters with a shared
count when the per-cpu counter gets larger than a certain value. The
default is very small, so we often end up taking a shared lock to update
the count. Use a larger batch so that we take the lock less often.
Signed-off-by: Zach Brown <zab@versity.com>
Now that the item cache is bearing the load of high frequency item
calls, we can remove all the item granular work that the forest was
trying to do. The item cache amortizes the cost of the forest so its
remaining methods can go straight to the btrees and don't need
complicated state to reduce the overhead of item calls.
Signed-off-by: Zach Brown <zab@versity.com>
Add an item cache between fs callers and the forest of btrees. Calling
out to the btrees for every item operation was far too expensive. This
gives us a flexible in-memory structure for working with items that
isn't bound by the constrants of persistent block IO. We can rarely
stream large groups of items to and from the btrees and then use
efficient kernel memory structures for more frequent item operations.
This adds the infrastructure, nothing is calling it yet.
Signed-off-by: Zach Brown <zab@versity.com>
Add forest calls that the item cache will use. It needs to read all the
items in the leaf blocks of forest btree which could contain the key,
write dirty items to the log btree, and dirty bits in the bloom block as
items are dirtied.
Signed-off-by: Zach Brown <zab@versity.com>
In a merge where the input and source trees are the same, the input
block can be an initial pre-cow version of the dirty source block.
Dirtying blocks in the change will clear allocations in the dirty source
block but they will remain in the pre-cow input block. The merge can
then set these blocks in the dst, even though they were also used by
allocation, because they're still set in the pre-cow input block.
This fix is clumsy, but minimal and specific to this problem. A more
thorough fix is being worked on which introduces more staging more
allocator trees and should stop calls that are modifying the current
active avail or free trees.
Signed-off-by: Zach Brown <zab@versity.com>
Lock invalidation has to make sure that changes are visible to future
readers. It was syncing if the current transaction is dirty. This was
never optimal, but it wasn't catastrophic when concurrent invalidation
work could all block on one sync in progress.
With the move to a single invalidation worker serially invalidating
locks it became unacceptable. Invalidation happening in the presence of
writers would constantly sync the current transaction while very old
unused write locks were invalidated. Their changes had long since been
committed in previous transactions.
We add a lock field to remember the transaction sequence which could
have been dirtied under the lock. If that transaction has already been
comitted by the time we invalidate the lock it doesn't have to sync.
Signed-off-by: Zach Brown <zab@versity.com>
The client lock network message processing callbacks were built to
simply perform the processing work for the message in the networking
work context that it was called in. This particularly makes sense for
invalidation because it has to interact with other components that
require blocking contexts (syncing commits, invalidating inodes,
truncating pages, etc).
The problem is that these messages are per-lock. With the right
workloads we can use all the capacity for executing work just in lock
invalidation work. There is no more work execution available for other
network processing. Critically, the blocked invalidation work is
waiting for the commit thread to get its network responses before
invalidation can make forward progress. I was easily reproducing
deadlocks by leaving behind a lot of locks and then triggering a flood
of invalidation requests on behalf of shrinking due to memory pressure.
The fix is to put locks on lists and have a small fixed number of work
contexts process all the locks pending for each message type. The
network callbacks don't block, they just put the lock on the list and
queue the work that will walk the lists. Invalidation now blocks one
work context, not the number of incoming requests.
There were some wait conditions in work that used to use the lock workq.
Other paths that change those conditions now have to know to queue the
work specifically, not just wake tasks which included blocked work
executors.
The other subtle impact of the change is that we can no longer rely on
networking to shutdown message processing work that was happening in its
callbacks. We have to specifically stop our work queues in _shutdown.
Signed-off-by: Zach Brown <zab@versity.com>
This introduces the srch mechanism that we'll use to accelerate finding
files based on the presence of a given named xattr. This is an
optimized version of the initial prototype that was using locked btree
items for .indx. xattrs.
This is built around specific compressed data structures, having the
operation cost match the reality of orders of magnitude more writers
than readers, and adopting a relaxed locking model. Combine all of this
and maintaining the xattrs no longer tanks creation rates while
maintaining excellent search latencies, given that searches are defined
as rare and relatively expensive.
The core data type is the srch entry which maps a hashed name to an
inode number. Mounts can append entries to the end of unsorted log
files during their transaction. The server tracks these files and
rotates them into a list of files as they get large enough. Mounts have
compaction work that regularly asks the server for a set of files to
read and combine into a single sorted output file. The server only
initiates compactions when it sees a number of files of roughly the same
size. Searches then walk all the commited srch files, both log files
and sorted compacted files, looking for entries that associate an xattr
name with an inode number.
Signed-off-by: Zach Brown <zab@versity.com>
The radix allocator has to be careful to not get lost in recursion
trying to allocate metadata blocks for its dirty radix blocks while
allocating metadata blocks for others.
The first pass had used path data structures to record the references to
all the blocks we'd need to modify to reflect the frees and allocations
performed while dirtying radix blocks. Once it had all the path blocks
it moved the old clean blocks into new dirty locations so that the
dirtying couldn't fail.
This had two very bad performance implications. First, it meant that
trying to read clean versions of dirtied trees would always read the old
blocks again because their clean version had been moved to the dirty
version. Typically this wouldn't happen but the server does exactly
this every time it tries to merge freed blocks back into its avail
allocator. This created a significant IO load on the server. Secondly,
that block cache move not being allowed to fail motivated us to move to
a locked rbtree for the block cache instead of the lockless rcu
radix_tree.
This changes the recursion avoidance to use per-block private metadata
to track every block that we allocate and cow rather than move. Each
dirty block knows its parent ref and the blknos it would clear and set.
If dirtying fails we can walk back through all the blocks we dirty and
restore their original references before dropping all the dirty blocks
and returning an error. This lets us get rid of the path structure
entirely and results in a much cleaner system.
This change meant tracking free blocks without clearing them as they're
used to satisfy dirty block allocations. The change now has a cursor
that walks the avail metadata tree without modifying it. While building
this it became clear that tracking the first set bits of refs doesn't
provide any value if we're always searching from a cursor. The cursor
ends up providing the same value of avoiding constantly searching empty
initial bits and refs. Maintaining the first metadata was just
overhead.
Signed-off-by: Zach Brown <zab@versity.com>
The forest item operations were reading the super block to find the
roots that it should read items from.
This was easiest to implement to start, but it is too expensive. We
have to find the roots for every newly acquired lock and every call to
walk the inode seq indexes.
To avoid all these reads we first send the current stable versions of
the fs and logs btrees roots along with root grants. Then we add a net
command to get the current stable roots from the server. This is used
to refresh the roots if stale blocks are encountered and on the seq
index queries.
Signed-off-by: Zach Brown <zab@versity.com>
File data allocations come from radix allocators which are populated by
the server before each client transation. It's possible to fully
consume the data allocator within one transaction if the number of dirty
metadata blocks is kept low. This could result in premature ENOSPC.
This was happening to the archive-light-cycle test. If the transactions
performed by previous tests lined up just right then the creation of the
initial test files could see ENOSPC and cause all sorts of nonsense in
the rest of the test, culminating in cmp commands stuck in offline
waits.
This introduces high and low data allocator water marks for
transactions. The server tries to fill data allocators for each
transaction to the high water mark and the client forces the commit of a
transaction if its data allocator falls below the low water mark.
The archive-light-cycle test now passes easily and we see the
trans_commit_data_alloc_low counter increasing during the test.
Signed-off-by: Zach Brown <zab@versity.com>
Remove a bunch of unused counters which have accumulated over time as
we've worked on the code and forgotten to remove counters.
Signed-off-by: Zach Brown <zab@versity.com>
The btree forest item storage doesn't have as much item granular state
as the item cache did. The item cache could tell if a cached item was
populated from persistent storage or was created in memory. It could
simply remove created items rather than leaving behind a deletion item.
The cached btree blocks in the btree forest item storage mechanism can't
do this. It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.
This created a problem with the extent storage we were using. The
individual extent items were stored with a key set to the last logical
block of their extent. As extents grew or shrank they often were
deleted and created at different key values during a transaction. In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent. Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.
Streaming writes would operate on O(n) for every extent operation. It
got to be out of hand. This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.
For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.
Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items. The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly. It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.
Previously the client and server would exchange extents with network
messages. Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction. The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.
The server no longer has to manage free extents. It transfers block
bitmap items between trees around commits. All of its extent
manipulation can be removed.
The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.
Signed-off-by: Zach Brown <zab@versity.com>
Transaction commit now has to ask the forest to write the btrees during
a transaction commit instead of writing dirty items in segments. It
also determines if holds fit in the dirty transaction by looking at
dirty btree blocks instead of item counts.
Locking no longer has to invalidate a private item cache because the
forest paths use the btree block cache where inconsistency is discovered
and invalidated as blocks are read.
Signed-off-by: Zach Brown <zab@versity.com>
Previous versions of the system had a simple block cache. This brings
it back with support for blocks that are larger than page size, a more
efficient LRU, and an explicit writer context.
Signed-off-by: Zach Brown <zab@versity.com>
The current quorum voting implementatoin had some rough edges that
increased the complexity of the system and introduced undesirable
failure modes. We can keep the same basic pattern but move
functionality around a few places, and rethink the quorum voting, to end
up with a meaningfully simpler system.
The motivation for this work was to remove the need to provide a
uniq_name option for every mount instance.
The first big change is to remove the idea of static configuration slots
for mounts. This removes the use of uniq_name. Mounts now simply have
a server_addr mount option instead of using their uniq_name to find
their address in the configuration.
The server can't check the configuration to see if a given connected
client's name is found in the quorum config. Clients can set a flag in
their sent greeting which indicates that they're a voter. This removes
the uniq_name from the greeting and mounted client records.
Without a static configuration mounts no longer have dedicated block
locations to write to. We increase the size of the region of quorum
blocks and have voters simply write to a random block. Overwriting vote
blocks is OK because we move from heartbeating design patterns to a
protocol strongly based on raft's election. We're using quorum blocks
to communicate votes instead of network messages and overwriting blocks
is analagous to lossy networks droping vote messages in the raft
election protocol.
We were using the dedicated per-mount quorum blocks to track mounts that
had been elected and needed to be fenced. We no longer have that
storage so instead we add the idea of an election log that is stored in
every voting block. Readers merge the logs from all the blocks they
read and write the resulting merged log in their block.
With no static quorum configuration we no longer have to worry about the
complexity of changing the slot configurations while they're in use.
The only persistent configuration is the number of votes a candidate
needs to be elected by a quorum.
It was a mistake to use quorum voting blocks to communicate state
between the server and the quorum voters. We can easily move the
unmount_barrier, server address, and fencing state from the quorum
blocks into the super block. The server no longer needs the quorum
election info struct to be able to later write its quorum block. It
instead writes a few fields in the super. There's only one place where
clients need to look to find out who they should connect to or if they
can finish unmount.
Signed-off-by: Zach Brown <zab@versity.com>
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO. As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.
This implements lock recovery by having the lock service recover locks
from clients as it starts up.
First the lock service stores records of connected clients in a btree
off the super block. Records are added as the server receives their
greeting and are removed as the server receives their farewell.
Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.
We add lock recover request and response messages that are used to
communicate locks from the clients to the server.
Signed-off-by: Zach Brown <zab@versity.com>
The current networking code has loose reliability guarantees. If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection. The client resends
requests but no responses are resent. A client's requests could be
processed twice on the same server. The server throws away disconnected
client state.
This was fine, sort of, for the simple requests we had implemented so
far. It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.
This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.
The server keeps track of disconnected clients and restores state if the
same client reconnects. This required some work around the greetings so
that clients and servers can recognize each other. Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.
Now that connections between the client and server are preserved we can
resend responses across reconnection. We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.
When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.
This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.
Signed-off-by: Zach Brown <zab@versity.com>
Convert client locking to call the server's lock service instead of
using a fs/dlm lockspace.
The client code gets some shims to send and receive lock messages to and
from the server. Callers use our lock mode constants instead of the
DLM's.
Locks are now identified by their starting key instead of an additional
scoped lock name so that we don't have more mapping structures to track.
The global rename lock uses keys that are defined by the format as only
used for locking.
The biggest change is in the client lock state machine. Instead of
calling the dlm and getting callbacks we send messages to our server and
get called from incoming message processing. We don't have everything
come through a per-lock work queue. Instead we send requests either
from the blocking lock caller or from a shrink work queue. Incoming
messages are called in the net layer's blocking work contexts so we
don't need to do any more work to defer to other contexts.
The different processing contexts leads to a slightly different lock
life cycle. We refactor and seperate allocation and freeing from
tracking and removing locks in data structures. We add a _get and _put
to track active use of locks and then async references to locks by
holders and requests are tracked seperately.
Our lock service's rules are a bit simpler in that we'll only ever send
one request at a time and the server will only ever send one request at
a time. We do have to do a bit of work to make sure we process back to
back grant reponses and invalidation requests from the server.
As of this change the lock setup and destruction paths are a little
wobbly. They'll be shored up as we add lock recovery between the client
and server.
Signed-off-by: Zach Brown <zab@versity.com>
Add a quorum election implementation. The mounts that can participate
in the election are specified in a quorum config array in the super
block. Each configured participant is assigned a preallocated block
that it can write to.
All mounts read the quorum blocks to find the member who was elected the
leader and should be running the server. The voting mounts loop reading
voting blocks and writing their vote block until someone is elected with
a amjority.
Nothing calls this code yet, this adds the initial implementation and
format.
Signed-off-by: Zach Brown <zab@versity.com>
Currently compaction is only performed by one thread running in the
server. Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.
This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server. This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.
The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight. It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.
A server thread still coordinates which segments are compacted. The
search for a candidate compaction operation is largely unchanged. It
now has to deal with being unable to process a compaction because its
segments are busy. We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests. If there are none at the level we move up to the next level.
The server will only issue a given number of compaction requests to a
client at a time. When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.
If a client disconnects the server forgets the compactions it had sent
to that client. If those compactions still need to be processed they'll
be sent to the next client.
The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes. This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.
The server needs to block as it does work for compaction in the
notify_up and response callbacks. We move them out from under spin
locks.
The server needs to clean up allocated segnos for a compaction request
that fails. We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.
Signed-off-by: Zach Brown <zab@versity.com>
It was a bit of an overreach to try and limit duplicate request
processing in the network layer. It introduced acks and the necessity
to resync last_processed_id on reconnect.
In testing compaction requests we saw that request processing stopped if
a client reconnected to a new server. The new server sent low request
ids which the client dropped because they were lower than the ids it got
from the last server. To fix this we'd need to add smarts to reset
ids when connecting to new servers but not existing servers.
In thinking about this, though, there's a bigger problem. Duplicate
request processing protection only works up in memory in the networking
connections. If the server makes persistent changes, then crashes, the
client will resend the request to the new server. It will need to
discover that the persistent changes have already been made.
So while we protected duplicate network request processing between nodes
that reconnected, we didn't protect duplicate persistent side-effects
of request processing when reconnecting to a new server. Once you see
that the request implementations have to take this into account then
duplicate request delivery becomes a simpler instance of this same case
and will be taken care of already. There's no need to implement the
complexity of protecting duplicate delivery between running nodes.
This removes the last_processed_id on the server. It removes resending
of responses and acks. Now that ids can be processed out of order we
remove the special known ID of greeting commands. They can be processed
as usual. When there's only request and response packets we can
differentiate them with a flag instead of a u8 message type.
Signed-off-by: Zach Brown <zab@versity.com>
We had fields in the segment header for the crc but weren't using it.
This calculates the crc on write and verifies it on read. The crc
covers the used bytes in the segment as indicated by the total_bytes
field.
Signed-off-by: Zach Brown <zab@versity.com>
The client and server networking code was a bit too rudimentary.
The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to. We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.
This refactors sending and receiving in both the client and server code
into shared networking code. It's built around a connection struct that
then holds the message state. Both peers on the connection can send
requests and send responses.
The existing code only retransmitted requests down newly established
connections. Requests could be processed twice.
This adds robust reliability guarantees. Requests are resend until
their response is received. Requests are only processed once by a given
peer, regardless of the connection's transport socket. Responses are
reiably resent until acknowledged.
This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal. A following commit will remove all
the unused code.
Signed-off-by: Zach Brown <zab@versity.com>
Add an extent function for iterating backwards through extents. We add
the wrapper and have the extent IO functions call their storage _prev
functions. Data extent IO can now call the new scoutfs_item_prev().
Signed-off-by: Zach Brown <zab@versity.com>
Add an fallocate operation.
This changes the possible combinations of flags in extents and makes it
possible to create extents beyond i_size. This will confuse the rest of
the code in a few places and that will be fixed up next.
Signed-off-by: Zach Brown <zab@versity.com>
The extent code was originally written to panic if it hit errors during
cleanup that resulted in inconsistent metadata. The more reasonble
strategy is to warn about the corruption and act accordingly and leave
it to corrective measures to resolve the corruption. In this case we
continue returning the error that caused us to try and clean up.
Signed-off-by: Zach Brown <zab@versity.com>
Have the server use the extent core to maintain free extent items in the
allocation btree instead of the bitmap items.
We add a client request to allocate an extent of a given length. The
existing segment alloc and free now work with a segment's worth of
blocks.
The server maintains counters in the super block of free blocks instead
of free segments. We maintain an allocation cursor so that allocation
results tend to cycle through the device. It's stored in the super so
that it is maintained across server instances.
This doesn't remove unused dead code to keep the commit from getting too
noisy. It'll be removed in a future commit.
Signed-off-by: Zach Brown <zab@versity.com>
Add a file of extent functions that callers will use to manipulate and
store extents in different persistent formats.
Signed-off-by: Zach Brown <zab@versity.com>