scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-08 03:30:46 +00:00

Author	SHA1	Message	Date
Zach Brown	04660dbfee	scoutfs: add scoutfs_extent_prev() Add an extent function for iterating backwards through extents. We add the wrapper and have the extent IO functions call their storage _prev functions. Data extent IO can now call the new scoutfs_item_prev(). Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9c74f2011d	scoutfs: add server work tracing Add some server workqueue and work tracing to chase down the destruction of an active workqueue. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	41c29c48dd	scoutfs: add extent corruption cases The extent code was originally written to panic if it hit errors during cleanup that resulted in inconsistent metadata. The more reasonble strategy is to warn about the corruption and act accordingly and leave it to corrective measures to resolve the corruption. In this case we continue returning the error that caused us to try and clean up. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	f3007f10ca	scoutfs: shut down server on commit errors We hadn't yet implemented any error handling in the server when commits fail. Commit errors are serious and we take them as a sign that something has gone horribly wrong. This patch prints commit error warnings to the console and shuts down. Clients will try to reconnect and resend their requests. The hope is that another server will be able to make progress. But this same node could become the server again and it could well be that the errors are persistent. The next steps are to implement server startup backoff, client retry backoff, and hard failure policies. Signed-off-by: Zach Brown <zab@versity.com>	2018-05-01 11:48:19 -07:00
Zach Brown	24cc5cc296	scoutfs: lock manifest root request The manifest root request processing samples the stable_manifest_root in the server info. The stable_manifest_root is updated after a commit has suceeded. The read of stable_manifest_root in request processing was locking the manifest. The update during commit doesn't lock the manifest so these paths were racing. The race is very tight, a few cpu stores, but it could in theory give a client a malformed root that could be misinterpreted as corruption. Add a seqcount around the store of the stable manifest root during commit and its load during request processing. This ensures that clients always get a consistent manifest root. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	8061a5cd28	scoutfs: add server bind warning Emit an error message if the server fails to bind. It can mean that there is a bad configured address. But we might want to be able to bind if the address becomes available, so we don't hard error. We only emit the message once for a series of failures. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 15:49:14 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	c76c6582f0	scoutfs: release server conn under mutex I was rarely seeing null derefs during unmount. The per-mount listening scoutfs_server_func() was seeing null sock->ops as it called kernel_sock_shutdown() to shutdown the connected client sockets. sock_release() sets the ops to null. We're not supposed to use a socket after we call it. The per-connection scoutfs_server_recv_func() calls sock_release() as it tears down its connection. But it does this before it removes the connection from the listener's list. There's a brief window where the connection's socket has been released but is still visible on the list. If the listener tries to shutdown during this time it will crash. Hitting this window depends on scheduling races during unmount. The unmount path has the client close its connection to the server then the server closes all its connected clients. If the local mount is the server then it will have recv work see an error as the client disconnects and it will be racing to shut down the connection with the listening thread during unmount. I think I only saw this in my guests because they're running slower debug kernels on my slower laptop. The window of vulnerability while the released socket is on the list is longer. The fix is to release the socket while we hold the mutex and are removing the connection from the list. A released socket is never visible on the list. While we're at it don't use list_for_each_entry_safe() to iterate over the connection list. We're not modifying it. This is an lingering artifact from previous versions of the server code. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-22 14:27:01 -08:00
Zach Brown	f52dc28322	scoutfs: simplify lock use of kernel dlm We had an excessive number of layers between scoutfs and the dlm code in the kernel. We had dlmglue, the scoutfs locks, and task refs. Each layer had structs that track the lifetime of the layer below it. We were about to add another layer to hold on to locks just a bit longer so that we can avoid down conversion and transaction commit storms under contention. This collapses all those layers into simple state machine in lock.c that manages the mode of dlm locks on behalf of the file system. The users of the lock interface are mainly unchanged. We did change from a heavier trylock to a lighter nonblock lock attempt and have to change the single rare readpage use. Lock fields change so a few external users of those fields change. This not only removes a lot of code it also contains functional improvements. For example, it can now convert directly to CW locks with a single lock request instead of having to use two by first converting to NL. It introduces the concept of an unlock grace period. Locks won't be dropped on behalf of other nodes soon after being unlocked so that tasks have a chance to batch up work before the other node gets a chance. This can result in two orders of magnitude improvements in the time it takes to, say, change a set of xattrs on the same file population from two nodes concurrently. There are significant changes to trace points, counters, and debug files that follow the implementation changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-14 15:00:17 -08:00
Zach Brown	4ff1e3020f	scoutfs: allocate inode numbers per directory Having an inode number allocation pool in the super block meant that all allocations across the mount are interleaved. This means that concurrent file creation in different directories will create overlapping inode numbers. This leads to lock contention as reasonable work loads will tend to distribute work by directories. The easy fix is to have per-directory inode number allocation pools. We take the opportunity to clean up the network request so that the caller gets the allocation instead of having it be fed back in via a weird callback. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-09 17:58:19 -08:00
Zach Brown	ec91a4375f	scoutfs: unlock the server listen lock Turns out the server wasn't explicitly unlocking the listen lock! This ended up working because we only shut down an active server on unmount and unmount will tear down the lock space which will drop the still held listen lock. That's just dumb. But it also forced using an awkward lock flag to avoid setting up a task ref for the lock hold which wouldn't have been torn down otherwise. By adding the lock we restore balance to the force and can get rid of that flag. Cool, cool, cool. Signed-off-by: Zach Brown <zab@versity.com>	2017-12-08 17:00:44 -06:00
Mark Fasheh	8064a161f0	scoutfs: better tracking of recursive lock holders This replaces the fragile recursive locking logic in dlmglue. In particular that code fails when we have a pending downconvert and a process comes in for a level that's compatible with the existing level. The downconvert will still happen which causes us to now believe we are holding a lock that we are not! We could go back to checking for holders that raced our downconvert worker but that had problems of its own (see commit e8f7ef0). Instead of trying to infer from lock state what we are allowed to do, let's be explicit. Each lock now has a tree of task refs. If you come in to acquire a lock, we look for our task in that tree. If it's not there, we know this is the first time this task wanted that lock, so we can continue. Otherwise we incremement a count on the task ref and return the already locked lock. Unlock does the opposite - it finds the task ref and decreases the count. On zero it will proceed with the actual unlock. The owning task is the only process allowed to manipulate a task ref, so we only have to lock manipulation of the tree. We make an exception for global locks which might be unlocked from another process context (in this case that means the node id lock). Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-12-08 10:25:30 -08:00
Zach Brown	cb879d9f37	scoutfs: add network greeting message Add a network greeting message that's exchanged between the client and server on every connection to make sure that we have the correct file system and format hash. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-12 13:57:31 -07:00
Zach Brown	1da18d17cf	scoutfs: use trylock for global server lock Shared unmount hasn't worked for a long time because we didn't have the server work woken out of blocking trying to acquire the lock. In the old lock code the wait conditions didn't test ->shutdown. dlmglue doesn't give us a reasonable way to break a caller out of a blocked lock. We could add some code to do it with a global context that'd have to wake all locks or add a call with a lock resource name, not a held lock, that'd wake that specific lock. Neither sound great. So instead we'll use trylock to get the server lock. It's guaranteed to make reasonble forward progress. The server work is already requeued with a delay to retry. While we're at it we add a global server lock instead of using the weird magical inode lock in the fs space. The server lock doesn't need keys or to participate in item cache consistency, etc. With this unmount works. All mounts will now generate regular background trylock requests. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	7854471475	scoutfs: fix server wq destory warning We were seeing warnings in destroy_workqueue() which meant that work was queued on the server workqueue after it was drained and before it was finally destroyed. The only work that wasn't properly waited for was the commit work. It looks like it'd be idle because the server receive threads all wait for their request processing work to finish. But the way the commit work is batched means that a request can have its commit processed by executing commit work while leaving the work queued for another run. Fix this by specifically waiting for the commit work to finish after the server work has waited for all the recv and compaction work to finish. I wasn't able to reliably trigger the assertion in repeated xfstests runs. This survived many runs also, let's see if it stops the destroy_workqueue() assertion from triggering in the future. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-12 15:22:03 -07:00
Zach Brown	51e03dcb7a	scoutfs: refactor inode locking function This is based on Mark Fasheh <mfasheh@versity.com>'s series that introduced inode refreshing after locking and a trylock for readpage. Rework the inode locking function so that it's more clearly named and takes flags and the inode struct. We have callers that want to lock the logical inode but aren't doing anything with the vfs inode so we provide that specific entry point. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:37:59 -07:00
Zach Brown	87ab27beb1	scoutfs: add statfs network message The ->statfs method was still using the super_block in the super_info that was read during mount. This will get progressively more out of date. We add a network message to ask the server for the current fields that impact statfs. This is always racy and the fields are mostly nonsense, but we try our best. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-11 10:43:35 -07:00
Zach Brown	c1b2ad9421	scoutfs: separate client and server net processing The networking code was really suffering by trying to combine the client and server processing paths into one file. The code can be a lot simpler by giving the client and server their own processing paths that take their different socket lifecysles into account. The client maintains a single connection. Blocked senders work on the socket under a sending mutex. The recv path runs in work that can be canceled after first shutting down the socket. A long running server work function acquires the listener lock, manages the listening socket, and accepts new sockets. Each accepted socket has a single recv work blocked waiting for requests. That then spawns concurrent processing work which sends replies under a sending mutex. All of this is torn down by shutting down sockets and canceling work which frees its context. All this restructuring makes it a lot easier to track what is happening in mount and unmount between the client and server. This fixes bugs where unmount was failing because the monolithic socket shutdown function was queueing other work while running while draining. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-04 10:47:42 -07:00

20 Commits