scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-22 14:30:31 +00:00

Author	SHA1	Message	Date
Zach Brown	31e474c5fa	Protect get_log_trees corruption with assertion Like a lot of places in the server, get_log_trees() doesn't have the tools in needs to safely unwind partial changes in the face of an error. In the worst case, it can have moved extents from the mount's log_trees item into the server's main data allocator. The dirty data allocator reference is in the super block so it can be written later. The dirty log_trees reference is on stack, though, so it will be thrown away on error. This ends up duplicating extents in the persistent structures because they're written in the new dirty allocator but still remain in the unwritten source log_trees allocator. This change makes it harder for that to happen. It dirties the log_trees item and always tries to update so that the dirty blocks are consistent if they're later written out. If we do get an error updating the item we throw an assertion. It's not great, but it matches other similar circumstances in other parts of the server. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-17 14:22:59 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach Brown	48f1305a8a	Increase server commit dirty block budget We're seeing allocator motion during get_log_trees dirty quite a lot of blocks, which makes sense. Let's continue to up the budget. If we still need significantly larger budgets we'll want to look into capping the dirty block use of the allocator extent movers which will mean changing callers to support partial progress. Signed-off-by: Zach Brown <zab@versity.com>	2022-05-05 12:11:14 -07:00
Zach Brown	d8231016f8	Free fewer log btree blocks per server commit After we've merged a log btree back into the main fs tree we kick off work to free all its blocks. This would fully fill the transactions free blocks list before stopping to apply the commit. Consuming the entire free list makes it hard to have concurrent holders of a commit who also want to free things. This chnages the log btree block freeing to limit itself to a fraction of the budget that each holder gets. That coarse limit avoids us having to precisely account for the allocations and frees while modifying the freeing item while still freeing many blocks per commit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:20 -07:00
Zach Brown	3c2b329675	Limit alloc consumption in server commits Server commits use an allocator that has a limited number of available metadata blocks and entries in a list for freed blocks. The allocator is refilled between commits. Holders can't fully consume the allocator during the commit and that tended to work out because server commit holders commit before sending responses. We'd tend to commit frequently enough that we'd get a chance to refill the allocators before they were consumed. But there was no mechanism to ensure that this would be the case. Enough concurrent server holders were able to fully consume the allocators before committing. This causes scoutfs_meta_alloc and _free to return errors, leading the server to fail in the worst cases. This changes the server commit tracking to use more robust structures which limit the number of concurrent holders so that the allocators aren't exhausted. The commit_users struct stops holders from making progress once the allocators don't have room for more holders. It also lets us stop future holders from making progress once the commit work has been queued. The previous cute use of a rwsem didn't allow for either of these protections. We don't have precise tracking of each holder's allocation consumption so we don't try and reserve blocks for each holder. Instead we have a maxmimum consumption per holder and make sure that all the holders can't consume the allocators if they all use their full limit. All of this requires the holding code paths to be well behaved and not use more than the per-hold limit. We add some debugging code to print the stacks of holders that were active when the total holder limit was exceeded. This is the motivation for having state in the holders. We can record some data at the time their hold started that'll make it a little easier to track down which of the holders exceeded their limit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:17 -07:00
Zach Brown	44f38a31ec	Make server commit access private again There was a brief time where we exported the ability to hold and apply commits outside of the main server code. That wasn't a great idea, and the few users have seen been reworked to not require directly manipulating server transactions, so we can reduce risk and make these functions private again. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:43 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach Brown	8decc54467	Clean up mount option handling The mount options code is some of the oldest in the tree and is weirdly split between options.c and super.c. This cleans up the options code, moves it all to options.c, and reworks it to be more in line with the modern subsystem convenction of storing state in an allocated info struct. Rather than putting the parsed options in the super for everyone to directly reference we put them in the private options info struct and add a locked read function. This will let us add sysfs files to change mount options while safely serializing with readers. All the users of mount options that used to directly reference the parsed struct now call the read function to get a copy. They're all small local changes except for quorum which saves a static copy of the quorum slot number because it references it in so many places and relies on it not changing. Finally, we remove the empty debugfs "options" directory. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	730a84af92	Silence resent log merge commit error The server's log merge complete request handler was considering the absence of the client's original request as a failure. Unfortunately, this case is possible if a previous server successfully completed the client's request but the response was lost because it stopped for whatever reason. The failure was being logged as a hard error to the console which was causing tests to occasionally fail during server failover that hit just as the log merge completion was being processed. The error was being sent to the client as a response, we just need to silence the message for these expected but rare errors. We also fix the related case where the server printed the even more harsh WARN_ON if there was a next original request but it wasn't the one we expected to find from our requesting client. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-02 11:26:36 -08:00
Zach Brown	5f2259c48f	Revert "Fix client/server race btwn lock recov and farewell" This reverts commit `61ad844891`. This fix was trying to ensure that lock recovery response handling can't run after farewell calls reclaim_rid() by jumping through a bunch of hoops to tear down locking state as the first farewell request arrived. It introduced very slippery use after free during shutdown. It appears that it was from drain_workqueue() previously being able to stop chaining work. That's no longer possible when you're trying to drain two workqueues that can queue work in each other. We found a much clearer way to solve the problem so we can toss this. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e14912974d	Wait for lock recovery before sending farewell We recently found that the server can send a farewell response and try to tear down a client's lock state while it was still in lock recovery with the client. The lock recovery response could add a lock for the client after farell's reclaim_rid() had thought the client was gone forever and tore down its locks. This left a lock in the lock server that wasn't associated with any clients and so could never be invalidated. Attempts to acquire conflicting locks with it would hang forever, which we saw as hangs in testing with lots of unmounting. We tried to fix it by serializing incoming request handling and forcefully clobbering the client's lock state as we first got the farewell request. That went very badly. This takes another approach of trying to explicitly wait for lock recovery to finish before sending farewell responses. It's more in line with the overall pattern of having the client be up and functional until farewell tears it down. With this in place we can revert the other attempted fix that was causing so many problems. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:39:51 -08:00
Zach Brown	e3c7e21c40	Use write memory barrier in set_shutting_down The server's little set_shutting_down() helper accidentally used a read barrier instead of a write barrier. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-19 09:17:38 -08:00
Bryant G. Duffy-Ly	61ad844891	Fix client/server race btwn lock recov and farewell Tear down client lock server state and set a boolean so that there is no race between client/server processing lock recovery at the same time as farewell. Currently there is a bug where if server and clients are unmounted then work from the client is processed out of order, which leaves behind a server_lock for a RID that no longer exists. In order to fix this we need to serialize SCOUTFS_NET_CMD_FAREWELL in recv_worker. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2022-01-13 13:32:56 -06:00
Bryant G. Duffy-Ly	cf4e6611d3	Fix inconsistency assertions at commit_log_merge Only BUG_ON for inconsistency and not do it for commit errors or failure to delete the original request. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-05 15:18:57 -05:00
Bryant G. Duffy-Ly	65429a9cc4	Ensure that writer_init and alloc_init are cleaned In scoutfs_server_worker we do not properly handle the clean up of _block_writer_init and alloc_init. On error paths we can clean up the context if either of thoes are initialized we can call alloc_prepare_commit or writer_forget_all to ensure we drop the block references and clear the dirty status of all the blocks in the writer. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-05 15:05:42 -05:00
Bryant G. Duffy-Ly	83a6bbb640	Fix inconsistency in server_log_merge_free_work In order to safely free blocks we need to first dirty the work. This allows for resume later on without a double free. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-03 17:09:51 -05:00
Zach Brown	618a7a4c47	Remove unused lock server alloc and wri While checking in on some other code I noticed that we have lingering allocator and writer contexts over in the lock server. The lock server used to manage its own client state and recovery. We've sinced moved that into shared recov functionality in the server. The lock server no longer manipulates its own btrees and doesn't need these unused references to the server's contexts. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	80ee2c6d57	Harden client transaction processing There are a few bad corner cases in the state machine that governs how client transactions are opened, modified, and committed. The worst problem is on the server side. All server request handlers need to cope with resent requests without causing bad side effects. Both get_log_trees and commit_log_trees would try to fully processes resent requests. _get_log_trees() looks safe because it works with the log_trees that was stored previously. _commit_log_trees() is not safe because it can rotate out the srch log file referenced by the sent log_trees every time it's processed. This could create extra srch entries which would delete the first instance of entries. Worse still, by injecting the same block structure into the system multiple times it ends up causing multiple frees of the blocks that make up the srch file. The client side problems are slightly different, but related. There aren't strong constraints which guarantee that we'll only send a commit request after a get request succeeds. In crazy circumstances the commit request in the write worker could come before the first get in mount succeeds. Far worse is that we can send multiple commit requests for one transaction if it changes as we get errors during multiple queued write attempts, particularly if we get errors from get_log_trees after having successfully committed. This hardens all these paths to ensure a strict sequence of get_log_trees, transaction modification, and commit_log_trees. On the server we add _trans_seq fields to the log_trees struct so that both get_ and commit_ can see that they've already prepared a commit to send or have already committed the incoming commit, respectively. We can use the get_trans_seq field as the trans_seq of the open transaction and get rid of the entire seperate mechanism we used to have for tracking open trans seqs in the clients. We can get the same info by walking the log_trees and looking at their _trans_seq fields. In the client we have the write worker immediately return success if mount hasn't opened the first transaction. Then we don't have the worker return to allow further modification until it has gotten success from get_log_trees. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	cf512c5fcf	Use inode_count field for statfs file counts Our statfs implementation had clients reading the super block and using the next free inode number to guess how many inodes there might be. We are very aggressive with giving directories private pools of inode numbers to allocate from. They're often not used at all, creating huge gaps in allocated inode numbers. The ratio of the average number of allocations per directory to the batch size given to each directory is the factor that the used inode count can be off by. Now that we have a precise count of active inodes we can use that to return accurate counts of inodes in the files fields in the statfs struct. We still don't have static inode allocation so the fields don't make a ton of sense. We fake the total and free count to give a reasonable estimate of the total files that doesn't change while the free count is calculated from the correct count of used inodes. While we're at it we add a request to get the summed fields that the server can cheaply discover in cache rather than having the client always perform read IOs. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	95ed36f9d3	Maintain inode count in super and log trees Add a count of used inodes to the super block and a change in the inode count to the log_trees struct. Client transactions track the change in inode count as they create and delete inodes. The log_trees delta is added to the count in the super as finalized log_trees are deleted. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	366f615c9f	Add support for our format version We had previously started on a relatively simple notion of an interoperability version which wasn't quite right. This fleshes out support for a more functional format version. The super blocks have a single version that defines behaviour of the running system. The code supports a range of versions and we add some initial interfaces for updating the version while the system is offline. All of this together should let us safely change the underlying format over time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	9b4ac64312	Consistently stop fencing as server stops As the server comes up it needs to fence any previous servers before it assumes exclusive access to the device. If fencing fails it can leave fence requests behind. The error path for these very early failures didn't shut down fencing so we'd have lingering fence requests span the life cycle of server startup and shutdown. The next time the server starts up in this mount it can try to create the fence request again, get an error because a lingering one already exists, and immediately shut down. The result is that fencing errors that hit that initial attempt during server startup can become persistent fencing errors for the lifetime of that mount, preventing it from every successfully starting the server. Moving the fence stop call to hit all exiting error paths consistently clean up fence requests and avoid this problem. The next server instance will get a chance to process the fence request again. It might well hit the same error, but at least it gets a chance. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:10:45 -07:00
Bryant Duffy-Ly	66b8c5fbd7	Enhance clarify of some kfree paths In some of the allocation paths there are goto statements that end up calling kfree(). That is fine, but in cases where the pointer is not initially set to NULL then we might have an undefined behavior. kfree on a NULL pointer does nothing, so essentially these changes should not change behavior, but clarifies the code path better. Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>	2021-10-06 18:07:27 -05:00
Zach Brown	bb571377dc	Don't merge newer items past older We have a problem where items can appear to go backwards in time because of the way we chose which log btrees to finalize and merge. Because we don't have versions in items in the fs_root, and even might not have items at all if they were deleted, we always assume items in log btrees are newer than items in the fs root. This creates the requirement that we can't merge a log btree if it has items that are also present in older versions in other log btrees which are not being merged. The unmerged old item in the log btree would take precedent over the newer merged item in the fs root. We weren't enforcing this requirement at all. We used the max_item_seq to ensure that all items were older than the current stable seq but that says nothing about the relationship between older items in the finalized and active log btrees. Nothing at all stops an active btree from having an old version of a newer item that is present in another mount's finalized log btree. To reliably fix this we create a strict item seq discontinuity between all the finalized merge inputs and all the active log btrees. Once any log btree is naturally finalized the server forced all the clients to group up and finalize all their open log btrees. A merge operation can then safely operate on all the finalized trees before any new trees are given to clients who would start using increasing items seqs. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	823838cf01	Add more messages to server processing errors The server doesn't give us much to go on when it gets an error handling requests to work with log trees from the client. This adds a lot of specific error messages so we can get a better understanding of failures. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	89b5865a4c	Verify that log tree commit is for sending rid We were trusting the rid in the log trees struct that the client sent. Compare it to our recorded rid on the connection and fail if the client sent the wrong rid. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-17 12:13:01 -07:00
Zach Brown	7e935898ab	Avoid premature metadata enospc server_get_log_trees() sets the low flag in a mount's meta_avail allocator, triggering enospc for any space consuming allocatins in the mount, if the server's global meta_vail pool falls below the reserved block count. Before each server transaction opens we swap the global meta_avail and meta_freed allocators to ensure that the transaction has at least the reserved count of blocks available. This creates a risk of premature enospc as the global meta_avail pool drains and swaps to the larger meta_freed. The pool can be close to the reserved count, perhaps at it exactly. _get_log_trees can fill the client's mount, even a little, and drop the global meta_avail total under the reserved count, triggering enospc, even though meta_Freed could have had quite a lot of blocks. The fix is to ensure that the global meta_avail has 2x the reserved count and swapping if it falls under that. This ensures that a server transaction can consume an entire reserved count and still have enough to avoid triggering enospc. This fixes a scattering of rare premature enospc returns that were hitting during tests. It was rare for meta_avail to fall just at the reserved count and for get_log_trees to have to refill the client allocator, but it happened. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	6d0694f1b0	Add resize_devices ioctl and scoutfs command Add a scoutfs command that uses an ioctl to send a request to the server to safely use a device that has grown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	2901b43906	Also allow omap requests to disconnected clients We recently fixed problems sending omap responses to originating clients which can race with the clients disconnecting. We need to handle the requests sent to clients on behalf of an origination request in exactly the same way. The send can race with the client being evicted. It'll be cleaned after the race is safely ignored by the client's rid being removed from the server's request tracking. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	e4dca8ddcc	Don't shutdown quorum if server startup fails The quorum service shuts down if it sees errors that mean that it can't do its job. This is mostly fatal errors gathering resources at startup or runtime IO errors but it was also shutting down if server startup fails. That's not quite right. This should be treated like the server shutting down on errors. Quorum needs to stay around to participate in electing the next server. Fence timeouts could trigger this. A quorum mount could crash, the next server without a fence script could have a fence request timeout and shutdown, and now the third remaining server is left to indefinitely send vote requests into the void. With this fixed, continuing that example, the quorum service in the second mount remains to elect the third server with a working fence script after the second server shuts down after its fence request times out. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	b4ede2ac6a	Allow omap responses to disconnected originators The omap message lifecycle is a little different than the server's usual handling that sends a response from the request handler. The response is sent long after the initial receive handler is pinning the connection to the client. It's fine for the response to be dropped. The main server request handler handled this case but other response senders didn't. Put this error handling in the server response sender itself so that all callers are covered. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-08 09:36:07 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	28759f3269	Rotate srch files as log trees items are reclaimed The log merging work deletes log trees items once their item roots are merged back into the fs root. Those deleted items could still have populated srch files that would be lost. We force rotation of the srch files in the items as they're reclaimed to turn them into rotated srch files that can be compacted. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:37:45 -07:00
Zach Brown	9c2122f7de	Add server btree merge processing This adds the server processing side of the btree merge functionality. The client isn't yet sending the log_merge messages so no merging will be performed. The bulk of the work happens as the server processess a get_log_merge message to build a merge request for the client. It starts a log merge if one isn't in flight. If one is in flight it checks to see if it should be spliced and maybe finished. In the common case it finds the next range to be merged and sends the request to the client to process. The commit_log_merge handler is the completion side of that request. If the request failed then we unwind its resources based on the stored request item. If it succeeds we record it in an item for get_ processing to splice eventually. Then we modify two existing server code paths. First, get_log_tree doesn't just create or use a single existing log btree for a client mount. If the existing log btree is large enough it sets its finalized flag and advances the nr to use a new log btree. That makes the old finalized log btree available for merging. Then we need to be a bit more careful when reclaiming the open log btree for a client. We can't use next to find the only open log btree, we use prev to find the last and make sure that it isn't already finalized. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	298a6a8865	Add server get_stable_trans_seq() Extract part of the get_last_seq handler into a call that finds the last stable client transaction seq. Log merging needs this to determine a cutoff for stable items in log btrees. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	65c39e5f97	Item seq is max of trans and lock write_seq Rename the item version to seq and set it to the max of the transaction seq and the lock's write_seq. This lets btree item merging chose a seq at which all dirty items written in future commits must have greater seqs. It can drop the seqs from items written to the fs tree during btree merging knowing that there aren't any older items out in transactions that could be mistaken for newer items. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	3c69861c03	Use core seq for lock write_seq Rename the write_version lock field to write_seq and get it from the core seq in the super block. We're doing this to create a relationship between a client transaction's seq and a lock's write_seq. New transactions will have a greater seq than all previously granted write locks and new write locks will have a greater seq than all open transactions. This will be used to resolve ambiguities in item merging as transaction seqs are written out of order and write locks span transactions. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:24:23 -07:00
Zach Brown	05ae756b74	Get trans seq from core seq Get the next seq for a client transaction from the core seq in the super block. Remove its specific next_trans_seq field. While making this change we switch to only using le64 in the network message payloads, the rest of the processing now uses natural u64s. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:46:19 -07:00
Zach Brown	9051ceb6fc	Add core seq to the super block Add a new seq field to the super block which will be the source of all incremented seqs throughout the system. We give out incremented seqs to callers with an atomic64_t in memory which is synced back to the super block as we commit transactions in the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:33:30 -07:00
Zach Brown	bad1c602f9	server hold_commit returns void When we moved to the current allocator we fixed up the server commit path to initialize the pair of allocators as a commit is finished rather than before it starts. This removed all the error cases from hold_commit. Remove the error handling from hold_commit calls to make the system just a bit simpler. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:32:26 -07:00
Zach Brown	38a4a56741	Stop writing to other quorum slot blocks The core quorum work loop assumes that it has exclusive access to its slot's quorum block. It uniquely marks blocks it writes and verifies the marks on read to discover if another mount has written to its slot under the assumption that this must be a configuration error that put two mounts in the same slot. But the design of the leader bit in the block violates the invariant that only a slot will write to its block. As the server comes up and fences previous leaders it writes to their block to clear their leader bit. The final hole in the design is that because we're fencing mounts, not slots, each slot can have two mounts in play. An active mount can be using the slot and there can still be a persistent record of a previous mount in the slot that crashed that needs to be fenced. All this comes together to have the server fence an old mount in a slot while a new mount is coming up. The new mount sees the mark change and freaks out and stops participating in quorum. The fix is to rework the quorum blocks so that each slot only writes to its own block. Instead of the server writing to each fenced mount's slot, it writes a fence event to its block once all previous mounts have been fenced. We add a bit of bookkeeping so that the server can discover when all block leader fence operations have completed. Each event gets its own term so we can compare events to discover live servers. We get rid of the write marks and instead have an event that is written as a quorum agent starts up and is then checked on every read to make sure it still matches. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-31 13:10:45 -07:00
Zach Brown	877e30d60f	Add client address to mounted_client item Add the peername of the client's connected socket to its mounted_client item as it mounts. If the client doesn't recover then fencing can use the IP to find the host to fence. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:39 -07:00
Zach Brown	ab5466a771	Protect server shutting down with smp barriers I saw a confusing hang that looked like a lack of ordering between a waker setting shutting_down and a wait event testing it after being woken up. Let's see if more barriers help. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	943351944a	Call fencing from the server The server is responsible for calling the fencing subsystem. It is the source of fencing requests as it decides that previous mounts are unresponsive. It is responsible for reclaiming resources for fenced mounts and freeing their associated fence request. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	e9d04dcf8d	Add forced unmount support Add super_ops->umount_begin so that we can implement a forced unmount which tries to avoid issuing any more network or storage ops. It can return errors and lose unsynchronized data. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:02:20 -07:00
Zach Brown	54644a5074	Add data_alloc_zone_blocks volume option Add the data_alloc_zone_blocks volume option. This changes the behaviour of the server to try and give mounts free data extents which fall in exclusive fixed-size zones. We add the field to the scoutfs_volume_options struct and add it to the set_volopt server handler which enforces constrains on the size of the zones. We then add fields to the log_trees struct which records the size of the zones and sets bits for the zones that contain free extents in the data_avail allocator root. The get_log_trees handler is changed to read all the zone bitmaps from all the items, pass those bitmaps in to _alloc_move to direct data allocations, and finally update the bitmaps in the log_trees items to cover the newly allocated extents. The log_trees data_alloc_zone fields are cleared as the mount's logs are reclaimed to indicate that the mount is no longer writing to the zone. The policy mechanism of finding free extents based on the bitmaps is ipmlemented down in _data_alloc_move(). Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	52c2a465db	Add zone awareness to scoutfs_alloc_move() Add parameters so that scoutfs_alloc_move() can first search for source extents in specified zones. It uses relatively cheap searches through the order items to find extents that intersect with the regions described by the zone bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	0aa6005c99	Add volume options super, server, and sysfs Introduce global volume options. They're stored in the superblock and can be seen in sysfs files that use network commands to get and set the options on the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-19 14:15:06 -07:00
Zach Brown	22371fe5bd	Fully destroy inodes after all mounts evict Today an inode's items are deleted once its nlink reaches zero and the final iput is called in a local mount. This can delete inodes from under other mounts which have opened the inode before it was unlinked on another mount. We fix this by adding cached inode tracking. Each mount maintains groups of cached inode bitmaps at the same granularity as inode locking. As a mount performs its final iput it gets a bitmap from the server which indicates if any other mount has inodes in the group open. This makes the two fast paths of opening and closing linked files and of deleting a file that was unlinked locally only pay a moderate cost of either maintaining the bitmap locally and only getting the open map once per lock group. Removing many files in a group will only lock and get the open map once per group. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00

1 2 3

135 Commits