scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-07-29 19:43:13 +00:00

Author	SHA1	Message	Date
Zach Brown	a61b8d9961	Fix renaming into root directory The VFS performs a lot of checks on renames before calling the fs method. We acquire locks and refresh inodes in the rename method so we have to duplciate a lot of the vfs checks. One of the checks involves loops with ancestors and subdirectories. We missed the case where the root directory is the destination and doesn't have any parent directories. The backref walker it calls returns -ENOENT instead of 0 with an empty set of parents and that error bubbled up to rename. The fix is to notice when we're asking for ancestors of the one directory that can't have ancestors and short circuit the test. Signed-off-by: Zach Brown <zab@versity.com>	2023-03-08 11:00:59 -08:00
Zach Brown	2e2ccb6f61	Allow replaying srch file rotation When a client no longer needs to append to a srch file, for whatever reason, we move the reference from the log_trees item into a specific srch file btree item in the server's srch file tracking btree. Zeroing the log_trees item and inserting the server's btree item are done in a server commit and should be written atomically. But commit_log_trees had an error handling case that could leave the newly inserted item dirty in memory without zeroing the srch file reference in the existing log_trees item. Future attempts to rotate the file reference, perhaps by retrying the commit or by reclaiming the client's rid, would get EEXIST and fail. This fixes the error handling path to ensure that we'll keep the dirty srch file btree and log_trees item in sync. The desynced items can still exist in the world so we'll tolerate getting EEXIST on insertion. After enough time has passed, or if repair zeroed the duplicate reference, we could remove this special case from insertion. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-17 14:33:27 -08:00
Zach BrownandGitHub	01c8bba56d	Merge pull request #109 from versity/zab/server_statfs_stable_blocks Zab/server statfs stable blocks	2023-01-12 09:58:48 -08:00
Zach BrownandGitHub	17cb1fe84b	Merge pull request #110 from versity/zab/partial_alloc_move Allow partial extent motion	2023-01-12 09:58:12 -08:00
Zach Brown	a23e7478a0	Fix move_blocks loop exit conditions The move_blocks ioctl intends to only move extents whose bytes fall inside i_size. This is easy except for a final extent that straddles an i_size that isn't aligned to 4K data blocks. The code that either checked for an extent being entirely past i_size or for limiting the number of blocks to move by i_size clumsily compared i_size offsets in bytes with extent counts in 4KB blocks. In just the right circumstances, probably with the help of a byte length to move that is much larger than i_size, the length calculation could result in trying to move 0 blocks. Once this hit the loop would keep finding that extent and calculating 0 blocks to move and would be stuck. We fix this by clamping the count of blocks in extents to move in terms of byte offsets at the start of the loop. This gets rid of the extra size checks and byte offset use in the loop. We also add a sanity check to make sure that we can't get stuck if, say, corruption resulted in an otherwise impossible zero length extent. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-10 09:34:52 -08:00
Zach Brown	7c2d83e2f8	Remove saved super block in scoutfs_sb_info Now that we've removed its users we can remove the global saved copy of the super block from scoutfs_sb_info. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-06 11:15:45 -08:00
Zach Brown	40aa47c888	Have the server keep a private dirty super block As the server does its work its transactions modify a dirty super block in memory. This used the global super block in scoutfs_sb_info which was visible to everything, including the client. Move the dirty super block over to the private server info so that only the server can see it. This is mostly boring storage motion but we do change that the quorum code hands the server a static copy of the quorum config to use as it starts up before it reads the most recent super block. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-06 11:15:45 -08:00
Zach Brown	c1bd7bcce5	Allow partial extent motion Refilling a client's data_avail is the only alloc_move call that doesn't try and limit the number of blocks that it dirties. If it doesn't find sufficiently large extents it can exhaust the server's alloc budget without hitting the target. It'll try to dirty blocks and return a hard error. This changes that behaviour to allow returning 0 if it moved any extents. Other callers can deal with partial progress as they already limit the blocks they dirty. This will also return ENOSPC if it hadn't moved anything just as the current code would. The result is that data fill can not necessarily hit the target. It might take multiple commits to fill the data_avail btree. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-15 20:47:41 -08:00
Zach Brown	7720222588	Have statfs use unlocked stable roots The server's statfs request handler was intending to lock dirty structures as they were walked to get sums used for statfs fields. Other callers walk stable structures, though, so the summation calls had grown iteration over other structures that the server didn't know it had to lock. This meant that the server was walking unlocked dirty structures as they were being modified. The races are very tight, but it can result in request handling errors that shut down connections and IO errors from trying to read inconsistent refs as they were modified by the locked writer. We've built up infrastructure so the server can now walk stable structures just like the other callers. It will no longer wander into dirty blocks so it doesn't need to lock them and it will retry if its walk of stale data crosses a broken reference. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	fff07ce19c	Use stale block read retrying helper Transition from manual checking for persistent ESTALE to the shared helper that we just added. This should not change behavior. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	464de56d28	Add stale block read retrying helper Many readers had little implementations of the logic to decide to retry stale reads with different refs or decide that they're persistent and return hard errors. Let's move that into a small helper. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	342c206550	Have scoutfs_forest_inode_count return stale reads scoutfs_forest_inode_count() assumed it was called with stable refs and would always translate ESTALE to EIO. Change it so that it passes ESTALE to the caller who is responsible for handling it. The server will use this to retry reading from stable supers that it's storing in memory. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	fe4734d019	Save a full stable super in the server The server has a mechanism for tracking the last stable roots used by network rpcs. We expand it a bit to include the entire super so that we can add users in the server which want the last full stable super. We can still use the stable super to give out the stable roots. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	b1a43bb312	Make quorum config use more precise The quorum code was using the copy of the super block in the sb info for its config. With that going away we make different users more carefully reference the config. The quorum agent has a copy that it reads on setup, the client rarely reads a copy when trying to connect, and the server uses its super. This is about data access isolation and should have no functional effect other than to cause more super reads. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	929703213f	Add fsid sbi field A few paths throughout the code get the fsid for the current mount by using the copy of the super block that we store in the scoutfs_sb_info for the mount. We'd like to remove the super block from the sbi and it's cleaner to have a specific constant field for the fsid of the mount which will not change. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	8e067b3d3f	Truncate dirties zero tail extension When we truncate away from a partial block we need to zero its tail that was past i_size and dirty it so that it's written. We missed the typical vfs boilerplate of calling block_truncate_page from setattr->set_size that does this. We need to be a little careful to pass our file lock down to get_block and then queue the inode for writeback so its written out with the transaction. This follows the pattern in .write_end. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-06 10:31:31 -08:00
Zach Brown	276fbebdac	Avoid dput in lock invalidation The d_prune_aliases in lock invalidation was thought to be safe because the caller had an inode refernece, surely it can't get into iput_final. I missed the fundamental dcache pattern that dput can ascend through parents and end up in inode eviction for entirely unrelated inodes. It's very easy for this to deadlock, imagine if nothing else that the inode invalidation is blocked on in dput->iput->evict->delete->lock is itself in the list of locks to invalidate in the caller. We fix this by always kicking off d_prune and dput into async work. This increases the chance that inodes will still be referenced after invalidation and prevent inline deletion. More deletions can be deferred until the orphan scanner finds them. It should be rare, though. We're still likely to put and drop invalidated inodes before a writer gets around to removing the final unlink and asking us for the omap that describes our cached inodes. To perform the d_prune in work we make it a behavioural flag and make our queued iputs a little more robust. We use much safer and understandable locking to cover the count and the new flags and we put the work in re-entrant work in their own workqueue instead of one work instance in the system_wq. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-02 12:28:13 -08:00
Zach Brown	71ed4512dc	Include primary lock write_seq for write_only vers FS items are deleted by logging a deletion item that has a greater item version than the item to delete. The versions are usually maintained by the write_seq of the exclusive write lock that protects the item. Any newer write hold will have a greater version than all previous write holds so any items created under the lock will have a greater vers than all previous items under the lock. All deletion items will be merged with the older item and both will be dropped. This doesn't work for concurrent write-only locks. The write-only locks match with each other so their write_seqs are asssigned in the order that they are granted. That grant order can be mismatched with item creation order. We can get deletion items with lesser versions than the item to delete because of when each creation's write-only lock was granted. Write only locks are used to maintain consistency between concurrent writers and readers, not between writers. Consistency between writers is done with another primary write lock. For example, if you're writing seq items to a write-only region you need to have the write lock on the inode for the specific seq item you're writing. The fix, then, is to pass these primary write locks down to the item cache so that it can chose an item version that is the greatest amongst the transaction, the write-only lock, and the primary lock. This now ensures that the primary lock's increasing write_seq makes it down to the item, bringing item version ordering in line with exclusive holds of the primary lock. All of this to fix concurrent inode updates sometimes leaving behind duplicate meta_seq items because old seq item deletions ended up with older versions than the seq item they tried to delete, nullifying the deletion. Signed-off-by: Zach Brown <zab@versity.com>	2022-11-15 13:26:32 -08:00
Zach Brown	aed4313995	Simplify dentry verification Now that we've removed the hash and pos from the dentry_info struct we can do without it. We can store the refresh gen in the d_fsdsta pointer (sorry, 64bit only for now.. could allocate if we needed to.) This gets rid of the lock coverage spinlocks and puts a bit more pressure on lock lookup, which we already know we have to make more efficient. We can get rid of all the dentry info allocation calls. Now that we're not setting d_op as we allocate d_fsdata we put the ops on the super block so that we get d_revalidate called on all our dentries. We also are a bit more precise about the errors we can return from verification. If the target of a dentry link changes then we return -ESTALE rather than silently performing the caller's operation on another inode. Signed-off-by: Zach Brown <zab@versity.com>	2022-10-27 14:32:06 -07:00
Zach Brown	61d86f7718	Add scoutfs_lock_ino_refresh_gen Add a lock call to get the current refresh_gen of a held lock. If the lock doesn't exist or isn't readable then we return 0. This an be used to track lock coverage of structures without the overhead and lifetime binding of the lock coverage struct. Signed-off-by: Zach Brown <zab@versity.com>	2022-10-27 14:16:07 -07:00
Zach Brown	717b56698a	Remove __exit from scoutfs_sysfs_exit() scoutfs_sysfs_exit() is called during error handling in module init. When scoutfs is built-in (so, never.) the __exit section won't be loaded. Remove the __exit annotation so it's always available to be called. Signed-off-by: Zach Brown <zab@versity.com>	2022-10-26 16:42:27 -07:00
Zach Brown	c92a7ff705	Don't use dentry private hash/pos for deletion The dentry cache life cycles are far too crazy to rely on d_fsdata being kept in sync with the rest of the dentry fields. Callers can do all sorts of crazy things with dentries. Only unlink and rename need these fields and those operations are already so expensive that item lookups to get the current actual hash and pos are lost in the noise. Signed-off-by: Zach Brown <zab@versity.com>	2022-10-26 16:42:26 -07:00
Zach Brown	ef2daf8857	Make data preallocation tunable Make mount options for the size of preallocation and whether or not it should be restricted to extending writes. Disabling the default restriction to streaming writes lets it preallocate in aligned regions of the preallocation size when they contain no extents. Signed-off-by: Zach Brown <zab@versity.com>	2022-10-14 14:03:35 -07:00
Zach Brown	ddc5d9f04d	Allow setting orphan_scan_delay_ms option The orphan_scan_delay_ms option setting code mistakenly set the default before testing the option for -1 (not the default) to discover if multiple options had been set. This made any attempt to set fail. Initialize the option to -1 so the first set succeeds and apply the default if we don't set the value. Signed-off-by: Zach Brown <zab@versity.com>	2022-09-28 10:36:10 -07:00
Zach Brown	433a80c6fc	Add compat for changing posix_acl_valid arguments Signed-off-by: Zach Brown <zab@versity.com>	2022-09-28 10:36:10 -07:00
Zach Brown	29538a9f45	Add POSIX ACL support Add support for the POSIX ACLs as described in acl(5). Support is enabled by default and can be explicitly enabled or disabled with the acl or noacl mount options, respectively. Signed-off-by: Zach Brown <zab@versity.com>	2022-09-28 10:36:10 -07:00
Zach Brown	1826048ca3	Add _locked xattr get and set calls The upcoming acl support wants to be able to get and set xattrs from callers who already have cluster locks and transactions. We refactor the existing xattr get and set calls into locked and unlocked variants. It's mostly boring code motion with the unfortunate situation that the caller needs to acquire the totl cluster lock before holding a transaction before calling into the xattr code. We push the parsing of the tags to the caller of the locked get and set so that they can know to acquire the right lock. (The acl callers will never be setting scoutfs. prefixed xattrs so they will never have tags.) Signed-off-by: Zach Brown <zab@versity.com>	2022-09-28 10:11:24 -07:00
Zach Brown	798fbb793e	Move to xattr_handler xattr prefix dispatch Move to the use of the array of xattr_handler structs on the super to dispatch set and get from generic_ based on the xattr prefix. This will make it easier to add handling of the pseudo system. ACL xattrs. Signed-off-by: Zach Brown <zab@versity.com>	2022-09-21 14:24:52 -07:00
Zach Brown	1cbc927ccb	Only clear trying inode deletion bit when set try_delete_inode_items() is responsible for making sure that it's safe to delete an inode's persistent items. One of the things it has to check is that there isn't another deletion attempt on the inode in this mount. It sets a bit in lock data while it's working and backs off if the bit is already set. Unfortunately it was always clearing this bit as it exited, regardless of whether it set it or not. This would let the next attempt perform the deletion again before the working task had finished. This was often not a problem because background orphan scanning is the only source of regular concurrent deletion attempts. But it's a big problem if a deletion attempt takes a very long time. It gives enough time for an orphan scan attempt to clear the bit then try again and clobber on whoever is performing the very slow deletion. I hit this in a test that built files with an absurd number of fragmented extents. The second concurrent orphan attempt was able to proceed with deletion and performed a bunch of duplicate data extent frees and caused corruption. The fix is to only clear the bit if we set it. Now all concurrent attempts will back off until the first task is done. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	233fbb39f3	Limit alloc_move per-call allocator consumption Recently scoutfs_alloc_move() was changed to try and limit the amount of metadata blocks it could allocate or free. The intent was to stop concurrent holders of a transaction from fully consuming the available allocator for the transaction. The limiting logic was a bit off. It stopped when the allocator had the caller's limit remaining, not when it had consumed the caller's limit. This is overly permissive and could still allow concurrent callers to consume the allocator. It was also triggering warning messages when a call consumed more than its allowed budget while holding a transaction. Unfortunately, we don't have per-caller tracking of allocator resource consumption. The best we can do is sample the allocators as we start and return if they drop by the caller's limit. This is overly conservative in that it accounts any consumption during concurrent callers to all callers. This isn't perfect but it makes the failure case less likely and the impact shouldn't be significant. We don't often have a lot of concurrency and the limits are larger than callers will typically consume. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:25:01 -07:00
Zach Brown	198d3cda32	Add scoutfs_alloc_meta_low_since() Add scoutfs_alloc_meta_low_since() to test if the metadata avail or freed resources have been used by a given amount since a previous snapshot. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-29 11:24:10 -07:00
Zach Brown	e8c64b4217	Move freed data extents in multiple server commits As _get_log_trees() in the server prepares the log_trees item for the client's commit, it moves all the freed data extents from the log_trees item into core data extent allocator btree items. If the freed blocks are very fragmented then it can exceed a commit's metadata allocation budget trying to dirty blocks in the free data extent btree. The fix is to move the freed data extents in multiple commits. First we move a limited number in the main commit that does all the rest of the work preparing the commit. Then we try to move the remaining freed extents in multiple additional commits. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-28 11:42:33 -07:00
Zach Brown	ba9a106f72	Free send attempts to disconnected clients Callers who send to specific client connections can get -ENOTCONN if their client has gone away. We forgot to free the send tracking struct in that case. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:20 -07:00
Zach Brown	310725eb72	Free omap rid list as server exits The omap code keeps track of rids that are connected to the server. It only freed the tracked rids as the server told it that rids were being removed. But that removal only happened as clients were evicted. If the server shutdown it'd leave the old rid entries around. They'd be leaked as the mount was unmounted and could linger and crate duplicate entries if the server started back up and the same clients reconnected. The fix is to free the tracking rids as the server shuts down. They'll be rebuilt as clients reconnect if the server restarts. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:19 -07:00
Zach Brown	51a8236316	Fix missed partial fill_super teardown If we return an error from .fill_super without having set sb->s_root then the vfs won't call our put_super. Our fill_super is careful to call put_super so that it can tear down partial state, but we weren't doing this with a few very early errors in fill_super. This tripped leak detection when we weren't freeing the sbi when returning errors from bad option parsing. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:19 -07:00
Zach Brown	f3dd00895b	Don't allocate zero size net info Clients don't use the net conn info and specified that it has 0 size. The net layer would try and allocate a zero size region which returns the magic ZERO_SIZE_PTR, which it would then later try and free. While that works, it's a little goofy. We can avoid the allocation when the size is 0. The pointer will remain null which kfree also accepts. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:16:19 -07:00
Zach Brown	31e474c5fa	Protect get_log_trees corruption with assertion Like a lot of places in the server, get_log_trees() doesn't have the tools in needs to safely unwind partial changes in the face of an error. In the worst case, it can have moved extents from the mount's log_trees item into the server's main data allocator. The dirty data allocator reference is in the super block so it can be written later. The dirty log_trees reference is on stack, though, so it will be thrown away on error. This ends up duplicating extents in the persistent structures because they're written in the new dirty allocator but still remain in the unwritten source log_trees allocator. This change makes it harder for that to happen. It dirties the log_trees item and always tries to update so that the dirty blocks are consistent if they're later written out. If we do get an error updating the item we throw an assertion. It's not great, but it matches other similar circumstances in other parts of the server. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-17 14:22:59 -07:00
Zach Brown	ae55fa3153	Set sk_allocation on TCP sockets We were setting sk_allocation on the quorum UDP sockets to prevent entering reclaim while using sockets but we missed setting it on the regular messaging TCP sockets. This could create deadlocks where the sending socket could enter scoutfs reclaim and wait for server messages while holding the socket lock, preventing the receive thread from receiving messages while it blocked on the socket lock. The fix is to prevent entering the FS to reclaim during socket allocations. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-14 08:21:19 -07:00
Zach Brown	0d4bf83da3	Reclaim log_trees alloc roots in multiple commits Client log_trees allocator btrees can build up quite a number of extents. In the right circumstances fragmented extents can have to dirty a large number of paths to leaf blocks in the core allocator btrees. It might not be possible to dirty all the blocks necessary to move all the extents in one commit. This reworks the extent motion so that it can be performed in multiple commits if the meta allocator for the commit runs out while it is moving extents. It's a minimal fix with as little disruption to the ordering of commits and locking as possible. It simply bubbles up an error when the allocators run out and retries functions that can already be retried in other circumstances. Signed-off-by: Zach Brown <zab@versity.com>	2022-06-08 11:53:53 -07:00
Zach BrownandGitHub	45d90a5ae4	Merge pull request #86 from versity/zab/increase_server_commit_block_budget Increase server commit dirty block budget	2022-05-06 09:47:47 -07:00
Zach Brown	48f1305a8a	Increase server commit dirty block budget We're seeing allocator motion during get_log_trees dirty quite a lot of blocks, which makes sense. Let's continue to up the budget. If we still need significantly larger budgets we'll want to look into capping the dirty block use of the allocator extent movers which will mean changing callers to support partial progress. Signed-off-by: Zach Brown <zab@versity.com>	2022-05-05 12:11:14 -07:00
Zach Brown	ca526e2bc0	Lock recovery uses old mode while invalidating When a new server starts up it rebuilds its view of all the granted locks with lock recovery messages. Clients give the server their granted lock modes which the server then uses to process all the resent lock requests from clients. The lock invalidation work in the client is responsible for transitioning an old granted mode to a new invalidated mode from an unsolicited message from the server. It has to process any client state that'd be incompatible with the new mode (write dirty data, drop caches). While it is doing this work, as an implementation short cut, it sets the granted lock mode to the new mode so that users that are compatible with the new invalidated mode can use the lock whlie it's being invalidated. Picture readers reading data while a write lock is invalidating and writing dirty data. A problem arises when a lock recover request is processed during lock invalidation. The client lock recover request handler sends a response with the current granted mode. The server takes this to mean that the invalidation is done but the client invalidation worker might still be writing data, dropping caches, etc. The server will allow the state machine to advance which can send grants to pending client requests which believed that the invalidation was done. All of this can lead to a grant response handler in the client tripping the assertion that there can not be cached items that were incompatible with the old mode in a grant from the server. Invalidation might still be invalidating caches. Hitting this bug is very rare and requires a new server starting up while a client has both a request outstanding and an invalidation being processed when the lock recover request arrives. The fix is to record the old mode during invalidation and send that in lock recover responses. This can lead the lock server to resend invalidation requests to the client. The client already safely handles duplicate invalidation requests from other failover cases. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-27 12:20:56 -07:00
Zach Brown	65654ee7c0	Fix getxattr with large values giving EINVAL The change to only allocate a buffer for the first xattr item with kmalloc instead of the entire logical xattr payload with vmalloc included a regression for getting large xattrs. getxattr used to copy the entire payload into the large vmalloc so it could unlock just after get_next_xattr. The change to only getting the first item buffer added a call to copy from the rest of the items but those copies weren't covered by the locks. This would often work because the lock pointer still pointed to a valid lock. But if the lock was invalidated then the mode would no longer be compatible and _item_lookup would return EINVAL. The fix is to extend xattr_rwsem and cluster lock coverage to the rest fo the function body, which includes the value item copies. This also makes getxattr's lock coverage consistent with setxattr and listxattr which might reduce the risk of similar mistakes in the future. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-04 12:49:50 -07:00
Zach Brown	d8231016f8	Free fewer log btree blocks per server commit After we've merged a log btree back into the main fs tree we kick off work to free all its blocks. This would fully fill the transactions free blocks list before stopping to apply the commit. Consuming the entire free list makes it hard to have concurrent holders of a commit who also want to free things. This chnages the log btree block freeing to limit itself to a fraction of the budget that each holder gets. That coarse limit avoids us having to precisely account for the allocations and frees while modifying the freeing item while still freeing many blocks per commit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:20 -07:00
Zach Brown	3c2b329675	Limit alloc consumption in server commits Server commits use an allocator that has a limited number of available metadata blocks and entries in a list for freed blocks. The allocator is refilled between commits. Holders can't fully consume the allocator during the commit and that tended to work out because server commit holders commit before sending responses. We'd tend to commit frequently enough that we'd get a chance to refill the allocators before they were consumed. But there was no mechanism to ensure that this would be the case. Enough concurrent server holders were able to fully consume the allocators before committing. This causes scoutfs_meta_alloc and _free to return errors, leading the server to fail in the worst cases. This changes the server commit tracking to use more robust structures which limit the number of concurrent holders so that the allocators aren't exhausted. The commit_users struct stops holders from making progress once the allocators don't have room for more holders. It also lets us stop future holders from making progress once the commit work has been queued. The previous cute use of a rwsem didn't allow for either of these protections. We don't have precise tracking of each holder's allocation consumption so we don't try and reserve blocks for each holder. Instead we have a maxmimum consumption per holder and make sure that all the holders can't consume the allocators if they all use their full limit. All of this requires the holding code paths to be well behaved and not use more than the per-hold limit. We add some debugging code to print the stacks of holders that were active when the total holder limit was exceeded. This is the motivation for having state in the holders. We can record some data at the time their hold started that'll make it a little easier to track down which of the holders exceeded their limit. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:28:17 -07:00
Zach Brown	96ad8dd510	Add scoutfs_alloc_meta_remaining Add helper function to give the caller the number of blocks remaining in the first list block that's used for meta allocation and freeing. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:44 -07:00
Zach Brown	44f38a31ec	Make server commit access private again There was a brief time where we exported the ability to hold and apply commits outside of the main server code. That wasn't a great idea, and the few users have seen been reworked to not require directly manipulating server transactions, so we can reduce risk and make these functions private again. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:43 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach Brown	c8d7221ec5	Show data device numbers in sysfs file It can be handy to associate mounts with their sysfs directory by their data device number. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-25 14:43:25 -07:00
Zach Brown	4e8a088cc5	Don't use vmalloc in get/set xattr Extended attribute values can be larger than a reasonable maximum size for our btree items so we store xattrs in many items. The first pass at this code used vmalloc to make it relatively easy to work with a contiguous buffer that was cut up into multiple items. The problem, of course, is that vmalloc() is expensive. Well, the problem is that I always forget just how expensive it can be and use it when I shouldn't. We had loads on high cpu count machines that were catastrophically cpu bound on all the contentious work that vmalloc does to maintain a coherent global address space. This removes the use of vmalloc and only allocates a small buffer for the first compound item. The later items directly reference regions of value buffer rather than copying it to and from the large intermediate vmalloced buffer. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-21 21:44:11 -07:00

1 2 3 4 5 ...