scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-07 12:35:28 +00:00

Author	SHA1	Message	Date
Zach Brown	f3764b873b	Save previous connected client address Our connection state spans sockets that can disconnect and reconnect. While sockets are connected we store the socket's remote address in the connection's peername and we clear it as sockets disconnect. Fencing wants to know the last connected address of the mount. It's a bit of metadata we know about the mount that can be used to find it and fence it. As we store the peer address we also stash it away as the last known peer address for the socket. Fencing can then use that instead of the current socket peer address which is guaranteed to be uninitialized because there's no socket connected. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	9ebc9d0f66	Manage client reconnect delay The client currently always queues immediate connect work if it's nodify_down is called. It was assuming that notify_down is only called from a healthy established connection. But it's also called for unsuccessful conneect attempts that might not have timed out. Say the host is up but the port isn't listening. This results in spamming connection attempts while an old stale leader block until a new server is elected, fences the previous leader, and updates their quorum block. The fix is to explicitly manage the connection work queueing delay. We only set it to immediately queue on mount and when we see a greeting reply from the server. We always set it to a longer timeout as we start a connection attempt. This means we'll always have a long reconnect delay unless we really connected to a server. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	8b78f701a1	Add fence-and-reclaim test Add a test which exercises the various reasons for fencing mounts and checks that we reclaim the resources that they had. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	1f1f40f079	Add fence agent that processes fence requests Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	943351944a	Call fencing from the server The server is responsible for calling the fencing subsystem. It is the source of fencing requests as it decides that previous mounts are unresponsive. It is responsible for reclaiming resources for fenced mounts and freeing their associated fence request. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	b060eb4f5d	Add fencing subsystem Add the subsystem which tracks pending fence requests and exposes them to userspace for processing. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:25 -07:00
Zach Brown	2dde729791	Add sysfs create attr w/ parent Add sysfs attribute creation that can provide the parent dir kobject instead of always creating the sysfs object dir off of the main per-mount dir. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:19 -07:00
Zach Brown	ccb7c0bf4b	Add rw sysfs attr wrapper Add a wrapper around __ATTR_RW so that callers can add attributes with a _store function. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:07 -07:00
Zach Brown	e9d04dcf8d	Add forced unmount support Add super_ops->umount_begin so that we can implement a forced unmount which tries to avoid issuing any more network or storage ops. It can return errors and lose unsynchronized data. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:02:20 -07:00
Zach Brown	5dceac32db	Merge pull request #40 from versity/zab/data_alloc_zones Zab/data alloc zones	2021-05-24 13:00:48 -07:00
Zach Brown	ef440ead28	Add -z to run-test for data-alloc-zone-blocks Add an option to run-tests which gets passed through to the data-alloc-zone-blocks argument for mkfs. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	d0b04e790c	Add data-alloc-zone-blocks argument to mkfs Add an argument to mkfs which sets the data_alloc_zone_blocks volume option. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	54644a5074	Add data_alloc_zone_blocks volume option Add the data_alloc_zone_blocks volume option. This changes the behaviour of the server to try and give mounts free data extents which fall in exclusive fixed-size zones. We add the field to the scoutfs_volume_options struct and add it to the set_volopt server handler which enforces constrains on the size of the zones. We then add fields to the log_trees struct which records the size of the zones and sets bits for the zones that contain free extents in the data_avail allocator root. The get_log_trees handler is changed to read all the zone bitmaps from all the items, pass those bitmaps in to _alloc_move to direct data allocations, and finally update the bitmaps in the log_trees items to cover the newly allocated extents. The log_trees data_alloc_zone fields are cleared as the mount's logs are reclaimed to indicate that the mount is no longer writing to the zone. The policy mechanism of finding free extents based on the bitmaps is ipmlemented down in _data_alloc_move(). Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	52c2a465db	Add zone awareness to scoutfs_alloc_move() Add parameters so that scoutfs_alloc_move() can first search for source extents in specified zones. It uses relatively cheap searches through the order items to find extents that intersect with the regions described by the zone bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	bc4975fad4	Add scoutfs_alloc_extents_cb() Add an allocator call for getting a callback for all the extents in btree items in an allocator root. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	9de3ae6dcb	Index free extents by order of length Allocators store free extents in two items, one sorted by their blkno position and the other by their precise length. The length index makes it easy to search for precise extent lengths, but it makes it hard to search for a large extent within a given blkno region. Skipping in the blkno dimension has to be done for every precise length value. We don't need that level of precision. If we index the extents by a coarser order of the length then we have a fixed number of orders in which we have to skip in the blkno dimension when searching within a specific region. This changes the length item to be stored at the log(8) order of the length of the extents. This groups extents into orders that are close to the human-friendly base 10 orders of magnitude. With this change the order field in the key no longer stores the precise extent length. To preserve the length of the extent we need to use another field. The only 64bit field remaining is the first which is a higher comparision priority than the type. So we use the highest comparison priority zone field to differentiate the position and order indexes and can now use all three 64bit fields in the key. Finally, we have to be careful when constructing a key to use _next when searching for a large extent. Previously keys were relying on the magic property that building a key from an extent length of 0 ended up at the key value -0 = 0. That only worked because we never stored zero length extents. We now store zero length orders so we can't use the negative trick anymore. We explicitly treat 0 length extents carefully when building keys and we subtract the order from U64_MAX to store the orders from largest to smallest. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:25:56 -07:00
Zach Brown	0aa6005c99	Add volume options super, server, and sysfs Introduce global volume options. They're stored in the superblock and can be seen in sysfs files that use network commands to get and set the options on the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-19 14:15:06 -07:00
Zach Brown	973dc4fd1c	Merge pull request #38 from versity/zab/read_xattr_deadlocks Zab/read xattr deadlocks	2021-05-03 09:44:57 -07:00
Zach Brown	a5ca5ee36d	Put back-to-back invalidated locks back on list A lock that is undergoing invalidation is put on a list of locks in the super block. Invalidation requests put locks on the list. While locks are invalidated they're temporarily put on a private list. To support a request arriving while the lock is being processed we carefully manage the invalidation fields in the lock between the invalidation worker and the incoming request. The worker correctly noticed that a new invalidation request had arrived but it left the lock on its private list instead of putting it back on the invalidation list for further processing. The lock was unreachable, wouldn't get invalidated, and caused everyone trying to use the lock to block indefinitely. When the worker sees another request arrive for an invalidating lock it needs to move the lock from the private list back to the invalidation list. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-30 10:00:07 -07:00
Zach Brown	603af327ac	Ignore I_FREEING in all inode hash lookups Previously we added a ilookup variant that ignored I_FREEING inodes to avoid a deadlock between lock invalidation (lock->I_FREEING) and eviction (I_FREEING->lock); Now we're seeing similar deadlocks between eviction (I_FREEING->lock) and fh_to_dentry's iget (lock->I_FREEING). I think it's reasonable to ignore all inodes with I_FREEING set when we're using our _test callback in ilookup or iget. We can remove the _nofreeing ilookup variant and move its I_FREEING test into the iget_test callback provided to both ilookup and iget. Callers will get the same result, it will just happen without waiting for a previously I_FREEING inode to leave. They'll get NULL instead of waiting from ilookup. They'll allocate and start to initialize a newer instance of the inode and insert it along side the previous instance. We don't have inode number re-use so we don't have the problem where a newly allocated inode number is relying on inode cache serialization to not find a previously allocated inode that is being evicted. This change does allow for concurrent iget of an inode number that is being deleted on a local node. This could happen in fh_to_dentry with a raw inode number. But this was already a problem between mounts because they don't have a shared inode cache to serialize them. Once we fix that between nodes, we fix it on a single node as well. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-28 12:22:10 -07:00
Zach Brown	ca320d02cb	Get i_mutex before cluster lock in file aio_read The vfs often calls filesystem methods with i_mutex held. This creates a natural ordering of i_mutex outside of cluster locks. The file aio_read method acquired i_mutex after its cluster lock, creating a deadlock with other vfs methods like setattr. The acquisition of i_mutex after the cluster lock was due to using the pattern where we use the per-task lock to discover if we're the first user of the lock in a call chain. Readpage has to do this, but file aio_read doesn't. It should never be called recursively. So we can acquire the i_mutex outside of the cluster lock and warn if we ever are called recursively. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-28 12:11:06 -07:00
Zach Brown	5231cf4034	Add export-lookup-evict-race test Add a test that creates races between fh_to_dentry and eviction triggered by lock invalidation. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-28 12:11:06 -07:00
Andy Grover	f631058265	Merge pull request #37 from versity/zab/test_mkdir_rename_unlink Add mkdir-rename-rmdir test	2021-04-27 13:21:27 -07:00
Zach Brown	1b4e60cae4	Add mkdir-rename-rmdir test Add a test which performs mkdir, two renames of the dir, and rmdir on all possible combinations of mounts. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-27 12:01:43 -07:00
Andy Grover	6eeaab3322	Merge pull request #35 from versity/zab/invalidate_already_pending Handle back to back invalidation requests	2021-04-23 16:40:45 -07:00
Andy Grover	ac68d14b8d	Merge pull request #36 from versity/zab/move_blocks_next_einval Fix accidental EINVAL in move_blocks	2021-04-23 14:39:29 -07:00
Zach Brown	ecfc8a0d0e	Merge pull request #33 from versity/zab/open_ino_map Zab/open ino map	2021-04-23 10:55:11 -07:00
Zach Brown	63148d426e	Fix accidental EINVAL in move_blocks When move blocks is staging it requires an overlapping offline extent to cover the entire region to move. It performs the stage by modifying extents at a time. If there are fragmented source extents it will modify each of them at a time in the region. When looking for the extent to match the source extent it looked from the iblock of the start of the whole operation, not the start of the source extent it's matching. This meant that it would find a the first previous online extent it just modified, which wouldn't be online, and would return -EINVAL. The fix is to have it search from the logical start of the extent it's trying to match, not the start of the region. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-23 10:39:34 -07:00
Zach Brown	a27c54568c	Handle back to back invalidation requests The client's incoming lock invalidation request handler triggers a BUG_ON if it gets a request for a lock that is already processing a previous invalidation request. The server is supposed to only send one request at a time. The problem is that the batched invalidation request handling will send responses outside of spinlock coverage before reacquirin the lock and finishing processing once the response send has been successful. This gives a window for another invalidation request to arrive after the response was sent but before the invalidation finished processing. This triggers the bug. The fix is to mark the lock such that we can recognize a valid second request arriving after we send the response but before we finish processing. If it arrives we'll continue invalidation processing with the arguments from the new request. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-22 17:00:50 -07:00
Zach Brown	dfc2f7a4e8	Remove unused scoutfs_free_unused_locks nr arg The nr argument wasn't used. It always tries to free as many as the shrinker call will let it. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	94dd86f762	Process lock invalidation after shutdown Lock teardown during unmount involves first calling shutdown and then destroy. The shutdown call is meant to ensure that it's safe to tear down the client network connections. Once shutdown returns locking is promising that it won't call into the client to send new lock requests. The current shutdown implementation is very heavy handed and shuts down everything. This creates a deadlock. After calling lock shutdown, the client will send its farewell and wait for a response. The server might not send the farewell response until other mounts have unmounted if our client is in the server's mount. In this case we stil have to be processing lock invalidation requests to allow other unmounting clients to make forward progress. This is reasonably easy and safe to do. We only use the shutdown flag to stop lock calls that would change lock state and send requests. We don't have it stop incoming requests processing in the work queueing functions. It's safe to keep processing incoming requests between _shutdown and _destroy because the requests already come in through the client. As the client shuts down it will stop calling us. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	841d22e26e	Disable task reclaim flags for block cache vmalloc Even though we can pass in gfp flags to vmalloc it eventually calls pte alloc functions which ignore the caller's flags and use user gfp flags. This risks reclaim re-entering fs paths during allocations in the block cache. These allocs that allowed reclaim deep in the fs was causing lockdep to add RECLAIM dependencies between locks and holler about deadlocks. We apply the same pattern that xfs does for disabling reclaim while allocating vmalloced block payloads. Setting PF_MEMALLOC_NOIO causes reclaim in that task to clear __GFP_IO and __GFP_FS, regardless of the individual allocation flags in the task, preventing recursion. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	ba8bf13ae1	Update dmesg whitelist for recovery The shared recovery layer outputs different messages than when it ran only for lock_recovery in the lock server. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	2949b6063f	Clear lock invalidate_pending during destroy Locks have a bunch of state that reflects concurrent processing. Testing that state determines when it's safe to free a lock because nothing is going on. During unmount we abruptly stop processing locks. Unmount will send a farewell to the server which will remove all the state associated with the client that's unmounting for all its locks, regardless of the state the locks were in. The client unmount path has to clean up the interupted lock state and free it, carefully avoiding assertions that would otherwise indicate that we're freeing used locks. The move to async lock invalidation forgot to clean up the invalidation state. Previously a synchronous work function would set and clear invalidate_pending while it was running. Once we finished waiting for it invalidate_pending would be clear. The move to async invalidation work meant that we can still have invalidate_pending with no work executing. Lock destruction removed locks from the invalidation list but forgot to clear the invalidate_pending flag. This triggered assertions during unmount that were otherwise harmless. There was other use of the lock, we just forgot to clean up the lock state. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	1e88aa6c0f	Shutdown data after trans The data_info struct holds the data allocator that is filled by transactions as they commit. We have to free it after we've shutdown transactions. It's more like the forest in this regard so we move its desctruction down by the forest to group similar behaviour. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	d9aea98220	Shutdown locking before transactions Shutting down the lock client waits for invalidation work and prevents future work from being queued. We're currently shutting down the subsystems that lock calls before lock itself, leading to crashes if we happen to have invalidations executing as we unmount. Shutting down locking before its dependencies fixes this. This was hit in testing during the inode deletion fixes because it created the perfect race by acquiring locks during unmount so that the server was very unlikely to send invalidations on behalf to one mount on behalf of another as they both unmounted. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	04f4b8bcb3	Perform final transaction write before shutdown Shutting down the transaction during unmount relied on the vfs unmount path to perform a sync of any remaining dirty transaction. There are ways that we can dirty a transaction during unmount after it calls the fs sync, so we try to write any remaining dirty transaction before shutting down. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	fead263af3	Remove unused sb_info shutdown We're no longer using the shutdown field in our sb info struct. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	4389c73c14	Fix deadlock between lock invalidate and evict We've had a long-standing deadlock between lock invalidation and eviction. Invalidating a lock wants to lookup inodes and drop their resources while blocking locks. Eviction wants to get a lock to perform final deletion while the inodes has I_FREEING set which blocks lookups. We only saw this deadlock a handful of times in all of the time we've run the code, but it's now much more common now that we're acquiring locks in iput to test that nlink is zero instead of only when nlink is zero. I see unmount hang regularly when testing final inode deletion. This adds a lookup variant for invalidation which will refuse to return freeing inodes so they won't be waited on. Once they're freeing they can't be seen by future lock users so they don't need to be invalidated. This keeps the lock invalication promise and avoids sleeping on freeing inodes which creates the deadlock. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	dba88705f7	Fix t_umount mount point number t_umount had a typo that had it try to unmount a mount based on a caller's variable, which accidentally happened to work for its only caller. Future callers would not have been so lucky. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	715c29aad3	Proactively drop dentry/inode caches outside locks Previously we wouldn't try and remove cached dentries and inodes as lock revocation removed cluster lock coverage. The next time we tried to use the cached dentries or inodes we'd acquire a lock and refresh them. But now cached inodes prevent final inode deletion. If they linger outside cluster locking then any final deletion will need to be deferred until all its cached inodes are naturally dropped at some point in the future across the cluster. It might take refreshing the dentries or for memory pressure to push out the old cached inodes. This tries to proctively drop cached dentries and inodes as we lose cluster lock coverage if they're not actively referenced. We need to be careful not to perform final inode deletion during lock invalidation because it will deadlock, so we defer an iput which could delete during evict out to async work. Now deletion can be done synchronously in the task that is performing the unlink because previous use of the inode on remote mounts hasn't left unused cached inodes sitting around. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	b244b2d59c	Add inode-deletion test Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	22371fe5bd	Fully destroy inodes after all mounts evict Today an inode's items are deleted once its nlink reaches zero and the final iput is called in a local mount. This can delete inodes from under other mounts which have opened the inode before it was unlinked on another mount. We fix this by adding cached inode tracking. Each mount maintains groups of cached inode bitmaps at the same granularity as inode locking. As a mount performs its final iput it gets a bitmap from the server which indicates if any other mount has inodes in the group open. This makes the two fast paths of opening and closing linked files and of deleting a file that was unlinked locally only pay a moderate cost of either maintaining the bitmap locally and only getting the open map once per lock group. Removing many files in a group will only lock and get the open map once per group. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	c6fd807638	Use recov to manage lock recovery Now that we have the recov layer we can have the lock server use it to track lock recovery. The lock server no longer needs its own recovery tracking structures and can instead call recov. We add a call for the server to call to kick lock processing once lock recovery finishes. We can get rid of the persistent lock_client items now that the server is driving recovery from the mounted_client items. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Zach Brown	592f472a1c	Use recov in server to recover client greetings The server starts recovery when it finds mounted client items as it starts up. The clients are done recovering once they send their greeting. If they don't recover in time then they'll be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Zach Brown	a65775588f	Add server recovery helpers Add a little set of functions to help the server track which clients are waiting to recover which state. The open map messages need to wait for recovery so we're moving recovery out of being only in the lock server. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Zach Brown	da1af9b841	Add scoutfs inode ino lock coverage Add lock coverage which tracks if the inode has been refreshed and is covered by the inode group cluster lock. This will be used by drop_inode and evict_inode to discover that the inode is current and doesn't need to be refreshed. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Zach Brown	accd680a7e	Fix block setup always returning 0 Another case of returning 0 instead of ret. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Andy Grover	cbb031bb5d	Merge pull request #32 from versity/zab/block_rhashtable_insert_fixes Zab/block rhashtable insert fixes	2021-04-13 10:42:17 -07:00
Zach Brown	c3290771a0	Block cache use rht _lookup_ insert for EEXIST The sneaky rhashtable_insert_fast() can't return -EEXIST despite the last line of the function REALLY making it look like it can. It just inserts new objects at the head of the bucket lists without comparing the insertion with existing objects. The block cache was relying on insertion to resolve duplicate racing allocated blocks. Because it couldn't return -EEXIST we could get duplicate cached blocks present in the hash table. rhashtable_lookup_insert_fast() fixes this by actually comparing the inserted objects key with the objects found in the insertion bucket. A racing allocator trying to insert a duplicate cached block will get an error, drop their allocated block, and retry their lookup. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 09:24:23 -07:00

1 2 3 4 5 ...

1376 Commits