Write locks are given an increasing version number as they're granted
which makes its way into items in the log btrees and is used to find the
most recent version of an item.
The initialization of the lock server's next write_version for granted
locks dates back to the initial prototype of the forest of log btrees.
It is only initialized to zero as the module is loaded. This means that
reloading the module, perhaps by rebooting, resets all the item versions
to 0 and can lead to newly written items being ignored in favour of
older existing items with greater versions from a previous mount.
To fix this we initialize the lock server's write_version to the
greatest of all the versions in items in log btrees. We add a field to
the log_trees struct which records the greatest version which is
maintained as we write out items in transactions. These are read by the
server as it starts.
Then lock recovery needs to include the write_version so that the
lock_server can be sure to set the next write_version past the greatest
version in the currently granted locks.
Signed-off-by: Zach Brown <zab@versity.com>
The log_trees structs store the data that is used by client commits.
The primary struct is communicated over the wire so it includes the rid
and nr that identify the log. The _val struct was stored in btree item
values and was missing the rid and nr because those were stored in the
item's key.
It's madness to duplicate the entire struct just to shave off those two
fields. We can remove the _val struct and store the main struct in item
values, including the rid and nr.
Signed-off-by: Zach Brown <zab@versity.com>
Previously the srch compaction work would output the entire compacted
file and delete the input files in one atomic commit. The server would
send the input files and an allocator to the client, and the client
would send back an output file and an allocator that included the
deletion of the input files. The server would merge in the allocator
and replace the input file items with the output file item.
Doing it this way required giving an enormous allocation pool to the
client in a radix, which would deal with recursive operations
(allocating from and freeing to the radix that is being modified). We
no longer have the radix allocator, and we use single block avail/free
lists instead of recursively modifying the btrees with free extent
items. The compaction RPC needs to work with a finite amount of
allocator resources that can be stored in an alloc list block.
The compaction work now does a fixed amount of work and a compaction
operation spans multiple work iterations.
A single compaction struct is now sent between the client and server in
the get_compact and commit_compact messages. The client records any
partial progress in the struct. The server writes that position into
PENDING items. It first searchs for pending items to give to clients
before searching for files to start a new compaction operation.
The compact struct has flags to indicate whether the output file is
being written or the input files are being deleted. The server manages
the flags and sets the input file deletion flag only once the result of
the compaction has been reflected in the btree items which record srch
files.
We added the progress fields to the compaction struct, making it even
bigger than it already was, so we take the time to allocate them rather
than declaring them on the stack.
It's worth mentioning that each operation now takes a reasonably bounded
amount of time will make it feasible to decide that it has failed and
needs to be fenced.
Signed-off-by: Zach Brown <zab@versity.com>
Remove the statfs RPC from the client and server now that we're using
allocator iteration to calculate free blocks.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we'd avoided full extents in file data mapping items because
we were deleting items from forest btrees directly. That created
deletion items for every version of file extents as they were modified.
Now we have the item cache which can remove deleted items from memory
when deletion items aren't necessary.
By layering file data extents on an extent layer, we can also transition
allocators to use extents and fix a lot of problems in the radix block
allocator.
Most of this change is churn from changing allocator function and struct
names.
File data extents no longer have to manage loading and storing from and
to packed extent items at a fixed granularity. All those loops are torn
out and data operations now call the extent layer with their callbacks
instead of calling its packed item extent functions. This now means
that fallocate and especially restoring offline extents can use larger
extents. Small file block allocation now comes from a cached extent
which reduces item calls for small file data streaming writes.
The big change in the server is to use more root structures to manage
recursive modification instead of relying on the allocator to notice and
do the right thing. The radix allocator tried to notice when it was
actively operating on a root that it was also using to allocate and free
metadata blocks. This resulted in a lot of bugs. Instead we now double
buffer the server's avail and freed roots so that the server fills and
drains the stable roots from the previous transaction. We also double
buffer the core fs metadata avail root so that we can increase the time
to reuse freed metadata blocks.
The server now only moves free extents into client allocators when they
fall below a low threshold. This reduces the shared modification of the
client's allocator roots which requires cold block reads on both the
client and server.
Signed-off-by: Zach Brown <zab@versity.com>
While checking for lost server commit holds, I noticed that the
advance_seq request path had obviously incorrect unwinding after getting
an error. Fix it up so that it always unlocks and applies its commit.
Signed-off-by: Zach Brown <zab@versity.com>
This introduces the srch mechanism that we'll use to accelerate finding
files based on the presence of a given named xattr. This is an
optimized version of the initial prototype that was using locked btree
items for .indx. xattrs.
This is built around specific compressed data structures, having the
operation cost match the reality of orders of magnitude more writers
than readers, and adopting a relaxed locking model. Combine all of this
and maintaining the xattrs no longer tanks creation rates while
maintaining excellent search latencies, given that searches are defined
as rare and relatively expensive.
The core data type is the srch entry which maps a hashed name to an
inode number. Mounts can append entries to the end of unsorted log
files during their transaction. The server tracks these files and
rotates them into a list of files as they get large enough. Mounts have
compaction work that regularly asks the server for a set of files to
read and combine into a single sorted output file. The server only
initiates compactions when it sees a number of files of roughly the same
size. Searches then walk all the commited srch files, both log files
and sorted compacted files, looking for entries that associate an xattr
name with an inode number.
Signed-off-by: Zach Brown <zab@versity.com>
The get_fs_roots rpc and server interfaces were built around individual
roots. Rebuild it around passing around a struct so that we can add
roots without impacting all the current users.
Signed-off-by: Zach Brown <zab@versity.com>
The conversion of the super block metadata block counters to units of
large metadata blocks forgot to scale back to the small block size when
filling out the block count fields in the statfs rpc. This resulted in
the free and total metadata use being off by the factor of large to
small block size (default of ~16x at the moment).
Signed-off-by: Zach Brown <zab@versity.com>
The forest item operations were reading the super block to find the
roots that it should read items from.
This was easiest to implement to start, but it is too expensive. We
have to find the roots for every newly acquired lock and every call to
walk the inode seq indexes.
To avoid all these reads we first send the current stable versions of
the fs and logs btrees roots along with root grants. Then we add a net
command to get the current stable roots from the server. This is used
to refresh the roots if stale blocks are encountered and on the seq
index queries.
Signed-off-by: Zach Brown <zab@versity.com>
Introduce different constants for small and large metadata block
sizes.
The small 4KB size is used for the super block, quorum blocks, and as
the granularity of file data block allocation. The larger 64KB size is
used for the radix, btree, and forest bloom metadata block structures.
The bulk of this are obvious transitions from the old single constant to
the appropriate new constant. But there are a few more involved
changes, though just barely.
The block crc calculation now needs the caller to pass in the size of
the block. The radix function to return free bytes instead returns free
blocks and the caller is responsible for knowing how big its managed
blocks are.
Signed-off-by: Zach Brown <zab@versity.com>
The btree currently uses variable length big-endian buffers that are
compared with memcmp() as keys. This is a historical relic of the time
when keys could be very large. We had dirent keys that included the
name and manifest entries that included those fs keys.
But now all the btree callers are jumping through hoops to translate
their fs keys into big-endian btree keys. And the memcmp() of the
keys is showing up in profiles.
This makes the btree take native scoutfs_key structs as its key. The
forest callers which are working with fs keys can just pass their keys
straight through. The server btree callers with their private btrees
get key fields definied for their use instead of having individual
big-endian key structs.
A nice side-effect of this is that splitting parents doesn't have to
assume that a maximal key will be inserted by a child split. We can
have more keys in parents and wider trees.
Signed-off-by: Zach Brown <zab@versity.com>
The calls for holding and applying commits in the server are currently
private. The lock server is a server component that has been seperated
out into its own file. Most of the time the server calls it during
commits so the btree changes made in the lock server are protected by
the commits. But there are btree calls in the lock server that happen
outside of calls from the server.
Exporting these calls will let the lock server make all its btree
changes in server commits.
Signed-off-by: Zach Brown <zab@versity.com>
File data allocations come from radix allocators which are populated by
the server before each client transation. It's possible to fully
consume the data allocator within one transaction if the number of dirty
metadata blocks is kept low. This could result in premature ENOSPC.
This was happening to the archive-light-cycle test. If the transactions
performed by previous tests lined up just right then the creation of the
initial test files could see ENOSPC and cause all sorts of nonsense in
the rest of the test, culminating in cmp commands stuck in offline
waits.
This introduces high and low data allocator water marks for
transactions. The server tries to fill data allocators for each
transaction to the high water mark and the client forces the commit of a
transaction if its data allocator falls below the low water mark.
The archive-light-cycle test now passes easily and we see the
trans_commit_data_alloc_low counter increasing during the test.
Signed-off-by: Zach Brown <zab@versity.com>
Add specific error messages for failures that can happen as the server
commits log trees from the client. These are severe enough that we'd
like to know about them.
Signed-off-by: Zach Brown <zab@versity.com>
The first pass at the radix allocator wasn't paying a lot of attention
to the allocation cursors.
This more carefully manages them. They're only advanced after
allocating. Previously the metadata alloc cursor was advanced as it
searched through leaves that it might allocate from. We test for
wrapping past the specific final allocatable bit, rather than the limit
of what the radix height can store. This required pushing knoweldge of
metadata or data allocs down through some of the code paths.
Signed-off-by: Zach Brown <zab@versity.com>
Reclaim freed metadata blocks in the server by merging the stable freed
tree into the allocator as a commit opens and we can trust that the
stable version of the freed allocator in the super is a strict subset of
the allocator's dirty freed tree.
Signed-off-by: Zach Brown <zab@versity.com>
Server processing paths had open coded management of holding and
applying transactions. Refactor that into hold_commit() and
apply_commit() helpers. It makes the code a whole lot clearer and gives
us a place in hold_commit() to add code that needs to be run before
anything is modified in a commit on the server.
Signed-off-by: Zach Brown <zab@versity.com>
The server now consistently reclaims free space in client allocator
radix trees. It merges the client's freed trees as the client
opens a new transaction. And it reclaims all the client's trees
when it is removed.
Signed-off-by: Zach Brown <zab@versity.com>
The removal of extent allocators in the server removed the tracking of
total free blocks in the system as extents were allocated and freed.
This restores tracking of total free blocks by observing the difference
in each allocator's sm_total count as a new version is stored during a
commit on the server.
We change the single free_blocks counter in the super to separate counts
of free metadata and data blocks to reflect the metadata and data
allocators. The statfs net command is updated.
Signed-off-by: Zach Brown <zab@versity.com>
Convert metadata block and file data extent allocations to use the radix
allocator.
Most of this is simple transitions between types and calls. The server
no longer has to initialize blocks because mkfs can write a single
radix parent block with fully set parent refs to initialize a full
radix. We remove the code and fields that were responsible for adding
uninitialized data and metadata.
The rest of the unused block allocator code is only ifdefed out. It'll
be removed in a separate patch to reduce noise here.
Signed-off-by: Zach Brown <zab@versity.com>
The btree forest item storage doesn't have as much item granular state
as the item cache did. The item cache could tell if a cached item was
populated from persistent storage or was created in memory. It could
simply remove created items rather than leaving behind a deletion item.
The cached btree blocks in the btree forest item storage mechanism can't
do this. It has to create deletion items when deleting newly created
items because it doesn't know if the item already exists in the
persistent record or not.
This created a problem with the extent storage we were using. The
individual extent items were stored with a key set to the last logical
block of their extent. As extents grew or shrank they often were
deleted and created at different key values during a transaction. In
the btree forest log trees this left a huge stream of deletion items
beind, one for every previous version of the extent. Then searches for
an extent covering a block would have to skip over all these deleted
items before hitting the current stored extent.
Streaming writes would operate on O(n) for every extent operation. It
got to be out of hand. This large change solves the problem by using
more coarse and stable item storage to track free blocks and blocks
mapped into file data.
For file data we now have large packed extent items which store packed
representations of all the logical mappings of a fixed region of a file.
The data code has loading and storage functions which transfer that
persistent version to and from the version that is modified in memory.
Free blocks are stored in bitmaps that are similarly efficiently packed
into fixed size items. The client is no longer working with free extent
items managed by the forest, it's working with free block bitmap btrees
directly. It needs access to the client's metadata block allocator and
block write contexts so we move those two out of the forest code and up
into the transaction.
Previously the client and server would exchange extents with network
messages. Now the roots of the btrees that store the free block bitmap
items are communicated along with the roots of the other trees involved
in a transaction. The client doesn't need to send free extents back to
the server so we can remove those tasks and rpcs.
The server no longer has to manage free extents. It transfers block
bitmap items between trees around commits. All of its extent
manipulation can be removed.
The item size portion of transaction item counts are removed because
we're not using that level of granularity now that metadata transactions
are dirty btree blocks instead of dirty items we pack into fixed sized
segments.
Signed-off-by: Zach Brown <zab@versity.com>
Add a simple start of a command that the client will use to commit its
dirty trees. This'll be expanded in the future to include more trees
and block allocation.
Signed-off-by: Zach Brown <zab@versity.com>
Teach the server to maintain and use its block allocator and writer
contexts when operating on its btrees.
The manifest tree operations aren't updated because they're about to be
removed.
Signed-off-by: Zach Brown <zab@versity.com>
Use the client's rid in networking instead of the node_id.
The node_id no longer has to be allocated by the server and sent in the
greeting. Instead the client sends it to the server in its greeting.
The server then uses the client's announced rid just like it used to use
the its node_id. It's used to record clients in the btree and to
identify clients in sending and receive processing.
The use of the rid in networking calls makes its way to locking and
compaction which now use the rid to identify clients intead of the
node_id.
Signed-off-by: Zach Brown <zab@versity.com>
The current quorum voting implementatoin had some rough edges that
increased the complexity of the system and introduced undesirable
failure modes. We can keep the same basic pattern but move
functionality around a few places, and rethink the quorum voting, to end
up with a meaningfully simpler system.
The motivation for this work was to remove the need to provide a
uniq_name option for every mount instance.
The first big change is to remove the idea of static configuration slots
for mounts. This removes the use of uniq_name. Mounts now simply have
a server_addr mount option instead of using their uniq_name to find
their address in the configuration.
The server can't check the configuration to see if a given connected
client's name is found in the quorum config. Clients can set a flag in
their sent greeting which indicates that they're a voter. This removes
the uniq_name from the greeting and mounted client records.
Without a static configuration mounts no longer have dedicated block
locations to write to. We increase the size of the region of quorum
blocks and have voters simply write to a random block. Overwriting vote
blocks is OK because we move from heartbeating design patterns to a
protocol strongly based on raft's election. We're using quorum blocks
to communicate votes instead of network messages and overwriting blocks
is analagous to lossy networks droping vote messages in the raft
election protocol.
We were using the dedicated per-mount quorum blocks to track mounts that
had been elected and needed to be fenced. We no longer have that
storage so instead we add the idea of an election log that is stored in
every voting block. Readers merge the logs from all the blocks they
read and write the resulting merged log in their block.
With no static quorum configuration we no longer have to worry about the
complexity of changing the slot configurations while they're in use.
The only persistent configuration is the number of votes a candidate
needs to be elected by a quorum.
It was a mistake to use quorum voting blocks to communicate state
between the server and the quorum voters. We can easily move the
unmount_barrier, server address, and fencing state from the quorum
blocks into the super block. The server no longer needs the quorum
election info struct to be able to later write its quorum block. It
instead writes a few fields in the super. There's only one place where
clients need to look to find out who they should connect to or if they
can finish unmount.
Signed-off-by: Zach Brown <zab@versity.com>
The pattern of advancing and writing a "dirty super" comes from the time
when the format had two persistent super blocks. One was kept in memory
and modified as changes were made. Advancing it changed which of the
two supers would be eventually written.
This no longer makes sense now that we only have one super block.
Remove the idea of advancing and writing an implicit dirty super block
that's stored in the super block info. Instead use a single
scoutfs_write_super() which takes the super block struct to write.
We still store and increment the hdr.gen in the super block. It used to
be used to tell which of the two super blocks are more recent, now it is
just some information that can tell us something about the life of the
super block.
Signed-off-by: Zach Brown <zab@versity.com>
An elected leader writes a quorum block showing that it's elected before
it assumes exclusive access to the device and starts bringing up the
server. This lets another later elected leader find and fence it if
something happens.
Other mounts were trying to connect to the server once this elected
quorum block was written and before the server was listening. They'd
get conection refused, decide to elect a new leader, and try to fence
the server that's still running.
Now, they should have tried much harder to connect to the elected leader
instead of taking a single failed attempt as fatal. But that's a
problem for another day that involves more work in balancing timeouts
and retries.
But mounts should not have tried try to connect to the server until its
listening. That's easy to signal by adding a simple listening flag to
the quorum block. Now mounts will only try to connect once they see the
listening flag and don't see these racey refused connections.
Signed-off-by: Zach Brown <zab@versity.com>
Now that a mount's client is responsible for electing and starting a
server we need to be careful about coordinating unmount. We can't
let unmounting clients leave the remaining mounted clients without
quorum.
The server carefully tracks who is mounted and who is unmounting while
it is processing farewell requests. It only sends responses to voting
mounts while quorum remains or once all the voting clients are all
trying to unmount.
We use a field in the quorum blocks to communicate to the final set of
unmounting voters that their farewells have been processed and that they
can finish unmounting without trying to restablish quorum.
The commit introduces and maintains the unmount_barrier field in the
quorum blocks. It is passed to the server from the election, the
server sends it to the client and writes new versions, and the client
compares what it received with what it sees in quorum blocks.
The commit then has the clients send their unique name to the server
who stores it in persistent mounted client records and compares the
names to the quorum config when deciding which farewell reqeusts
can be responded to.
Now that farewell response processing can block for a very long time it
is moved off into async work so that it doesn't prevent net connections
from being shutdown and re-established. This also makes it easier to
make global decisions based on the count of pending farewell requests.
Signed-off-by: Zach Brown <zab@versity.com>
We were relying on a cute (and probably broken) trick of defining
pointers to unaligned base types with __packed. Modern versions of gcc
warn about this.
Instead we either directly access unaligned types with get_ and
put_unaligned, or we copy unaligned data into aligned copies before
working with it.
Signed-off-by: Zach Brown <zab@versity.com>
Currently the server tracks the outstanding transaction sequence numbers
that clients have open in a simple list in memory. It's not properly
cleaned up if a client unmounts and a new server that takes over
after a crash won't know about open transaction sequence numbers.
This stores open transaction sequence numbers in a shared persistent
btree instead of in memory. It removes tracking for clients as they
send their farewell during unmount. A new server that starts up will
see existing entries for clients that were created by old servers.
This fixes a bug where a client who unmounts could leave behind a
pending sequence number that would never be cleaned up and would
indefinitely limit the visibility of index items that came after it.
Signed-off-by: Zach Brown <zab@versity.com>
When a server crashes all the connected clients still have operational
locks and can be using them to protect IO. As a new server starts up
its lock service needs to account for those outstanding locks before
granting new locks to clients.
This implements lock recovery by having the lock service recover locks
from clients as it starts up.
First the lock service stores records of connected clients in a btree
off the super block. Records are added as the server receives their
greeting and are removed as the server receives their farewell.
Then the server checks for existing persistent records as it starts up.
If it finds any it enters recovery and waits for all the old clients to
reconnect before resuming normal processing.
We add lock recover request and response messages that are used to
communicate locks from the clients to the server.
Signed-off-by: Zach Brown <zab@versity.com>
The current networking code has loose reliability guarantees. If a
connection between the client and server is broken then the client
reconnects as though its an entirely new connection. The client resends
requests but no responses are resent. A client's requests could be
processed twice on the same server. The server throws away disconnected
client state.
This was fine, sort of, for the simple requests we had implemented so
far. It's not good enough for the locking service which would prefer to
let networking worry about reliable message delivery so it doesn't have
to track and replay partial state across reconnection between the same
client and server.
This adds the infrastructure to ensure that requests and responses
between a given client and server will be delivered across reconnected
sockets and will only be processed once.
The server keeps track of disconnected clients and restores state if the
same client reconnects. This required some work around the greetings so
that clients and servers can recognize each other. Now that the server
remembers disconnected clients we add a farewell request so that servers
can forget about clients that are shutting down and won't be
reconnecting.
Now that connections between the client and server are preserved we can
resend responses across reconnection. We add outgoing message sequence
numbers which are used to drop duplicates and communicate the received
sequence back to the sender to free responses once they're received.
When the client is reconnecting to a new server it resets its receive
state that was dependent on the old server and it drops responses which
were being sent to a server instance which no longer exists.
This stronger reliable messaging guarantee will make it much easier
to implement lock recovery which can now rewind state relative to
requests that are in flight and replay existing state on a new server
instance.
Signed-off-by: Zach Brown <zab@versity.com>
The network greeting exchange was mistakenly using the global super
block magic number instead of the per-volume fsid to identify the
volumes that the endpoints are working with. This prevented the check
from doing its only job: to fail when clients in one volume try to
connect to a server in another.
Signed-off-by: Zach Brown <zab@versity.com>
Currently all mounts try to get a dlm lock which gives them exclusive
access to become the server for the filesystem. That isn't going to
work if we're moving to locking provided by the server.
This uses quorum election to determine who should run the server. We
switch from long running server work blocked trying to get a lock to
calls which start and stop the server.
Signed-off-by: Zach Brown <zab@versity.com>
Add the core lock server code for providing a lock service from our
server. The lock messages are wired up but nothing calls them.
Signed-off-by: Zach Brown <zab@versity.com>
The server forgot to initialize ret to 0 and might return
undefined errnos if a client asked it to free zero extents.
Signed-off-by: Zach Brown <zab@versity.com>
Currently compaction is only performed by one thread running in the
server. Total metadata throughput of the system is limited by only
having one compaction operation in flight at a time.
This refactors the compaction code to have the server send compaction
requests to clients who then perform the compaction and send responses
to the server. This spreads compaction load out amongst all the clients
and greatly increases total compaction throughput.
The manifest keeps track of compactions that are in flight at a given
level so that we maintain segment count invariants with multiple
compactions in flight. It also uses the sparse bitmap to lock down
segments that are being used as inputs to avoid duplicating items across
two concurrent compactions.
A server thread still coordinates which segments are compacted. The
search for a candidate compaction operation is largely unchanged. It
now has to deal with being unable to process a compaction because its
segments are busy. We add some logic to keep searching in a level until
we find a compaction that doesn't intersect with current compaction
requests. If there are none at the level we move up to the next level.
The server will only issue a given number of compaction requests to a
client at a time. When it needs to send a compaction request it rotates
through the current clients until it finds one that doesn't have the max
in flight.
If a client disconnects the server forgets the compactions it had sent
to that client. If those compactions still need to be processed they'll
be sent to the next client.
The segnos that are allocated for compaction are not reclaimed if a
client disconnects or the server crashes. This is a known deficiency
that will be addressed with the broader work to add crash recovery to
the multiple points in the protocol where the server and client trade
ownership of persistent state.
The server needs to block as it does work for compaction in the
notify_up and response callbacks. We move them out from under spin
locks.
The server needs to clean up allocated segnos for a compaction request
that fails. We let the client send a data payload along with an error
response so that it can give the server the id of the compaction that
failed.
Signed-off-by: Zach Brown <zab@versity.com>
We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND
in some workqueues where we wanted concurrency by scheduling across cpus
instead of waiting for the current (very long running) work on a cpu to
finish. We add NON_REENTRANT out of an abundance of caution. It has
gone away in modern kernels and is probably not needed here, but
according to the docs we would want it so we at least document that fact
by using it.
Signed-off-by: Zach Brown <zab@versity.com>
This extends the notify up and down calls to let the server keep track
of connected clients.
It adds the notion of per-connection info that is allocated for each
connection. It's passed to the notification callbacks so that callers
can have per-client storage without having to manage allocations in the
callbacks.
It adds the node_id argument to the notification callbacks to indicate
if the call is for the listening socket itself or an accepted client
connection on that listening socket.
Signed-off-by: Zach Brown <zab@versity.com>
The current sending interfaces only send a message to the peer of a
given connection. For the server to send to a specific connected client
it'd have to track connections itself and send to them.
This adds a sending interface that uses the node_id to send to a
specific connected client. The conn argument is the listening socket
and its accepted sockets are searched for the destination node_id.
Signed-off-by: Zach Brown <zab@versity.com>
Today node_ids are randomly assigned. This adds the risk of failure
from random number generation and still allows for the risk of
collisions.
Switch to assigning strictly advancing node_ids on the server during the
initial connection greeting message exchange. This simplifies the
system and allows us to derive information from the relative values of
node_ids in the system.
To do this we refactor the greeting code from internal to the net layer
to proper client and server request and response processing. This lets
the server manage persistent node_id storage and allows the client to
wait for a node_id during mount.
Now that net_connect is sync in the client we don't need the notify_up
callback anymore. The client can perform those duties when the connect
returns.
The net code still has to snoop on request and response processing to
see when the greetings have been exchange and allow messages to flow.
Signed-off-by: Zach Brown <zab@versity.com>
The free_blocks counter in the super is meant to track the number of
total blocks in the primary free extent index. Callers of extent
manipulation were trying to keep it in sync with the extents.
Segment allocation was allocating extents manually using a cursor. It
forgot to update free_blocks. Segment freeing then freed the segment as
an extent which did update free_blocks. This created ever accumulating
free blocks over time which eventually pushed it greater than total
blocks and caused df to report negative usage.
This updates the free_blocks count in server extent io which is the only
place we update the extent items themselves. This ensures that we'll
keep the count in sync with the extent items. Callers don't have to
worry about it.
Signed-off-by: Zach Brown <zab@versity.com>
T# with '#' will be ignored, and an empty message aborts the commit.
The previous commit added shared networking code and disabled the old
unused code. This removes all that unused client and server code that
was refactored to become the shared networking code.
Signed-off-by: Zach Brown <zab@versity.com>
The client and server networking code was a bit too rudimentary.
The existing code only had support for the client synchronously and
actively sending requests that the server could only passively respond
to. We're going to need the server to be able to send requests to
connected clients and it can't block waiting for responses from each
one.
This refactors sending and receiving in both the client and server code
into shared networking code. It's built around a connection struct that
then holds the message state. Both peers on the connection can send
requests and send responses.
The existing code only retransmitted requests down newly established
connections. Requests could be processed twice.
This adds robust reliability guarantees. Requests are resend until
their response is received. Requests are only processed once by a given
peer, regardless of the connection's transport socket. Responses are
reiably resent until acknowledged.
This only adds the new refactored code and disables the old unused code
to keep the diff foot print minmal. A following commit will remove all
the unused code.
Signed-off-by: Zach Brown <zab@versity.com>
Freed file data extents are tracked in free extent items in each node.
They could only be re-used in the future for file data extent allocation
on that node. Allocations on other nodes or, critically, segment
allocation on the server could never see those free extents. With the
right allocation patterns, particularly allocating on node X and freeing
on node Y, all the free extents can build up on a node and starve other
allocations.
This adds a simple high water mark after which nodes start returning
free extents to the server. From there they can satisfy segment
allocations or be sent to other nodes for file data extent allocation.
Signed-off-by: Zach Brown <zab@versity.com>
The code that works with the super block had drifted a bit. We still
had two from an old design and we weren't doing anything with its crc.
Move to only using one super block at a fixed blkno and store and verify
its crc field by sharing code with the btree block checksumming.
Signed-off-by: Zach Brown <zab@versity.com>
The server send_reply interface is confusing. It uses errors to shut
down the connection. Clients getting enospc needs to happen in the
message reply payload.
The segno allocation server processing needs to set the segno to 0 so
that the client gets it and translates that into -ENOSPC.
Signed-off-by: Zach Brown <zab@versity.com>