scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-06-02 17:56:20 +00:00

Author	SHA1	Message	Date
Zach Brown	5bc95fac7d	Add scoutfs_unmounting() Add a quick helper that can be used to avoid doing work if we know that we're already shutting down. This can be a single coarser indicator than adding functions to each subsystem to track that we're shutting down. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	36fcc4665d	Align first free ino to lock group Currently the first inode number that can be allocated directly follows the root inode. This means the first batch of allocated inodes are in the same lock group as the root inode. The root inode is a bit special. It is always hot as absolute path lookups and inode-to-path resolution always read directory entries from the root. Let's try aligning the first free inode number to the next inode lock group boundary. This will stop work in those inodes from necessarily conflicting with work in the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	b0a08eb922	Remove lock grace period We had some logic to try and delay lock invalidation while the lock was still actively in use. This was trying to reduce the cost of pathological lock conflict cases but it had some severe fairness problems. It was first introduced to deal with bad patterns in userspace that no longer exist and it was built on top of the LSM transaction machinery that also no longer exists. It hasn't aged well. Instead of introducing invalidation latency in the hopes that it leads to more batched work, which it can't always, let's aim more towards reducing latency in all parts of the write-invalidate-read path and also aim towards reducing contention in the first place. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	bb571377dc	Don't merge newer items past older We have a problem where items can appear to go backwards in time because of the way we chose which log btrees to finalize and merge. Because we don't have versions in items in the fs_root, and even might not have items at all if they were deleted, we always assume items in log btrees are newer than items in the fs root. This creates the requirement that we can't merge a log btree if it has items that are also present in older versions in other log btrees which are not being merged. The unmerged old item in the log btree would take precedent over the newer merged item in the fs root. We weren't enforcing this requirement at all. We used the max_item_seq to ensure that all items were older than the current stable seq but that says nothing about the relationship between older items in the finalized and active log btrees. Nothing at all stops an active btree from having an old version of a newer item that is present in another mount's finalized log btree. To reliably fix this we create a strict item seq discontinuity between all the finalized merge inputs and all the active log btrees. Once any log btree is naturally finalized the server forced all the clients to group up and finalize all their open log btrees. A merge operation can then safely operate on all the finalized trees before any new trees are given to clients who would start using increasing items seqs. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5897f4d889	Add a trivial trace_printk wrapper Make it a bit easier to include the fsid and rid in trace_printk messages when we're experimenting. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:12:20 -07:00
Zach Brown	999093bfc9	Add sync log trees network command Add a command for the server to request that clients commit their open transaction. This will be used to create groups of finalized log btrees for consistent merging. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:12:17 -07:00
Zach Brown	05b5d93365	Verify that quorum_slot_nr references valid slot We were checking that quorum_slot_nr was within the range of possible slots allowed by the format as it was parsed. We weren't checking that it referenced a configured slot. Make sure, and give a nice error message that shows the configured slots. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	4d7191dc48	Print messages on extent ins/rem errors Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	4495dbdce6	Set initial quorum term from max of all blocks During rough forced unmount testing we saw a seemingly mysterious concurrent election. It could be explained if mounts coming up don't start with the same term. Let's try having mounts initialize their term to the greatest of all the terms they can see in the quorum blocks. This will prevent the situation where some new quorum actors with greater terms start out ignoring all the messages from others. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	70569b0448	Trivial quorum test;set -> test_and_set Nothing interesting here, just a minor convenience to use test and set instead of testing and then setting. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	823838cf01	Add more messages to server processing errors The server doesn't give us much to go on when it gets an error handling requests to work with log trees from the client. This adds a lot of specific error messages so we can get a better understanding of failures. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	89b5865a4c	Verify that log tree commit is for sending rid We were trusting the rid in the log trees struct that the client sent. Compare it to our recorded rid on the connection and fail if the client sent the wrong rid. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-17 12:13:01 -07:00
Zach Brown	7cf9cd8c20	Merge pull request #48 from versity/zab/missed_invalidate_wakeup Queue invalidation during previous request	2021-08-09 09:50:39 -07:00
Zach Brown	65ac42831f	Queue invalidation during previous request The locking protocol only allows one outstanding invalidation request for a lock at a time. The client invalidation state is a bit hairy and involves removing the lock from the invalidation list while it is being processed which includes sending the response. This means that another request can arrive while the lock is not on the invalidation list. We have fields in the lock to record another incoming request which puts the lock back on the list. But the invalidation work wasn't always queued again in this case. It looks like the incoming request path would queue the work, but by definition the lock isn't on the invalidation list during this race. If it's the only lock in play then the invalidation list will be empty and the work won't be queued. The lock can get stuck with a pending invalidation if nothing else kicks the invaliation worker. We saw this in testing when the root inode lock group missed the wakeup. The fix is to have the work requeue itself after putting the lock back on the invalidation list when it notices that another request came in. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-06 15:41:11 -07:00
Zach Brown	dde6dab0a1	Merge pull request #47 from versity/zab/stability_fixes Zab/stability fixes	2021-08-02 12:22:44 -07:00
Zach Brown	cb1726681c	Fix net BUG_ON if reconnection farewell send races When a client socket disconnects we save the connection state to re-use later if the client reconnects. A newly accepted connection finds the old connection associated with the reconnecting client and migrates state from the old idle connection to the newly accepted connection. While moving messages between the old and new send and resend queues the code had an aggressive BUG_ON that was asserting that the newly accepted connection couldn't have any messages in its resend queue. This BUG can be tripped due to the ordering of greeting processing and connection state migration. The server greeting processing path sends the farewell response to the client before it calls the net code to migrate connection state. When it "sends" the farewell response it puts the message on the send queue and kicks the send work. It's possible for the send work to execute and move the farewell response to the resend queue and trip the BUG_ON. This is harmless. The sent greeting response is going to end up on the resend queue either way, there's no reason for the reconnection migration to assert that it can't have happened yet. It is going to be dropped the moment we get a message from the client with a recv_seq that is necessarily past the greeting response which always gets a seq of 1 from the newly accepted connection. We remove the BUG_ON and try to splice the old resend queue after the possible response at the head of the resend_queue so that it is the first to be dropped. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-02 11:15:57 -07:00
Zach Brown	cdff272163	Fix alloc list exhaustion calculation The last thing server commits do is move extents from the freed list into freed extents. It moves as many as it can until it runs out of avail meta blocks and space fore freed meta blocks in the current allocator's lists. The calculation for whether the lists had resources to move an extent was quite off. It missed that the first move might have to dirty the current allocator or the list block, that the btree could join/split blocks at each level down the paths, and boy does it look like the height component of the calculation was just bonkers. With the wrong calculation the server could overflow the freed list while moving extents and trigger a BUG_ON. We rarely saw this in testing. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-01 14:31:57 -07:00
Zach Brown	7e935898ab	Avoid premature metadata enospc server_get_log_trees() sets the low flag in a mount's meta_avail allocator, triggering enospc for any space consuming allocatins in the mount, if the server's global meta_vail pool falls below the reserved block count. Before each server transaction opens we swap the global meta_avail and meta_freed allocators to ensure that the transaction has at least the reserved count of blocks available. This creates a risk of premature enospc as the global meta_avail pool drains and swaps to the larger meta_freed. The pool can be close to the reserved count, perhaps at it exactly. _get_log_trees can fill the client's mount, even a little, and drop the global meta_avail total under the reserved count, triggering enospc, even though meta_Freed could have had quite a lot of blocks. The fix is to ensure that the global meta_avail has 2x the reserved count and swapping if it falls under that. This ensures that a server transaction can consume an entire reserved count and still have enough to avoid triggering enospc. This fixes a scattering of rare premature enospc returns that were hitting during tests. It was rare for meta_avail to fall just at the reserved count and for get_log_trees to have to refill the client allocator, but it happened. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	6d0694f1b0	Add resize_devices ioctl and scoutfs command Add a scoutfs command that uses an ioctl to send a request to the server to safely use a device that has grown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	fd686cab86	Fix total_data_blocks calculation in mkfs mkfs was incorrectly initializing total_data_blocks. The field is meant to record the number of blocks from the start of the device that the filesystem could access. mkfs was subtracting the initial reserved area of the device, meaning the number of blocks that the filesystem might access. This could allow accesses past devices if mount checks the device size against the smaller total_data_blocks. And we're about to use total_data_blocks as the start of a new extent to add when growing the volume. It needs to be fixed so that this new grown free extent doesn't overlap with the end of the existing free extents. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:26:32 -07:00
Zach Brown	4c1181c055	Remove first_ and last_ super blkno fields There are fields in the super block that specify the range of blocks that would be used for metadata or data. They are from the time when a single block device was carved up into regions for metadata and data. They don't make sense now that we have separate metadata and data block devices. The starting blkno is static and we go to the end of the device. This removes the fields now that they serve no purpose. The only use of them to check that freed extents fell within the correct bounds can still be performed by using the static starting number or roughly using the size of the devices. It's not perfect, but this is already only a check to see that the blknos aren't utter nonsense. We're removing the fields now to avoid having to update them while worrying about users when resizing devices. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	4893a6f915	scoutfs_dirents_equal should return bool It looks like it returned u64 because it was derived from _name_hash(). Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	384590f016	Sync net shouldn't wait for errored submits If async network request submission fails then the response handler will never be called. The sync request wrapper made the mistake of trying to wait for completion when initial submission failed. This never happened in normal operation but we're able to trigger it with some regularity with forced unmount during tests. Unmount would hang waiting for work to shutdown which was waiting for request responses that would never happen. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	192f077c16	Update data_version when fallocate changes size Changing the file size can changes the file contents -- reads will change when they stop returning data. fallocate can change the file size and if it does it should increment the data_version, just like setattr does. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	a9baeab22e	stage_tmpfile test gets current data_version The stage_tmpfile test util was written when fallocate didn't update data_version for size extensions. It is more correct to get the data_version after fallocate changes data_versions for however many transactions, extent allocations, and i_size extensions it took to allocate space. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	b7ab26539a	Avoid lockdep warning about upstream inversion Some kernels have blkdev_reread_part acquire the bd_mutex and then call into drop_partitions which calls fsync_bdev which acquires s_umount. This inverts the usual pattern of deactivate_super getting s_umount and then using blkdev_put in kill_sb->put_super to drop a second device. The inversion has been fixed upstream by years of rewrites. We can't go back in time to fix the kernels that we're testing against, unfortunately, so we disable lockdep around our valid leg of the inversion that lockdep is noticing in our testing. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	c51f0c37da	Defer dirty inode data writeback (and use list) iput() can only be used in contexts that could perform final inode deletion which requires cluster locks and transactions. This is absolutely true for the transaction committing worker. We can't have deletion during transaction commit trying to get locks and dirty more items in the transaction. Now that we're properly getting locks in final inode deletion and O_TMPFILE support has put pressure on deletion, we're seeing deadlocks between inode eviction during transaction commit getting a index lock and index lock invalidation trying to commit. We use the newly offered queued iput to defer the iput from walking our dirty inodes. The transaction commit will be able to proceed while the iput worker is off waiting for a lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:20:40 -07:00
Zach Brown	52107424dd	Promote deferred iput to inode call Lock invalidation had the ability to kick iput off to work context. We need to use it for inode writeback as well so we move the mechanism over to inode.c and give it a proper call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	099a65ab07	Try recovering from truncate errors and more info We're seeing errors during truncate that are surprising. Let's try and recover from them and provide more info when they happen so that we can dig deeper. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	21c5724dd5	Update fenced service file StartLimitBurst The first draft was written against an older schema, StartLimitBurst is in [Service] now. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	3974d98f6b	Don't use "/dev/*" redirections near systemd It sets up stdout and stderr as sockets, not pipes, so these links don't work. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	2901b43906	Also allow omap requests to disconnected clients We recently fixed problems sending omap responses to originating clients which can race with the clients disconnecting. We need to handle the requests sent to clients on behalf of an origination request in exactly the same way. The send can race with the client being evicted. It'll be cleaned after the race is safely ignored by the client's rid being removed from the server's request tracking. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	03d7a4e7fe	Show relative times in quorum status file output The times in the quorum status file are in absolute monotinic kernel time since bootup. That's not particularly helpful especially when comparing across hosts with different boot times. This shows relative times in timespec64 seconds until or since the times in question. While we're at it we also collect the send and receive timestamps closer to each send or receive call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	d5d3b12986	Specficially shutdown quorum during forced unmount Generally, forced unmount works by returning errors for all IO. Quorum is pretty resilient in that it can have the IO errors eaten by server startup and does its own messaging that won't return errors. Trying to force unmount can have the quorum service continually participate in electing a server that immediately fails and shutds down. This specifically shuts down the internal quorum service when it sees that unmount is being forced. This is easier and cleaner than having the network IO return errors and then having that trigger shutdown. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	e4dca8ddcc	Don't shutdown quorum if server startup fails The quorum service shuts down if it sees errors that mean that it can't do its job. This is mostly fatal errors gathering resources at startup or runtime IO errors but it was also shutting down if server startup fails. That's not quite right. This should be treated like the server shutting down on errors. Quorum needs to stay around to participate in electing the next server. Fence timeouts could trigger this. A quorum mount could crash, the next server without a fence script could have a fence request timeout and shutdown, and now the third remaining server is left to indefinitely send vote requests into the void. With this fixed, continuing that example, the quorum service in the second mount remains to elect the third server with a working fence script after the second server shuts down after its fence request times out. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	011b7d52e5	Merge pull request #45 from versity/ben/systemd_configs Add fenced systemd and example configs	2021-07-09 08:39:18 -07:00
Ben McClelland	3a9db45194	Add fenced systemd and example configs This should be good enough to get single node mounts up and running with fenced with minimal effort. The example config will need to be copied to /etc/scoutfs/scoutfs-fenced.conf for it to be functional, so this still requires specific opt-in and wont accidentally run for multi-node systems. Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>	2021-07-09 08:22:39 -07:00
Zach Brown	53f11f5479	Merge pull request #46 from versity/zab/orphan_deletion_and_enospc Zab/orphan deletion and enospc	2021-07-08 10:52:53 -07:00
Zach Brown	b4ede2ac6a	Allow omap responses to disconnected originators The omap message lifecycle is a little different than the server's usual handling that sends a response from the request handler. The response is sent long after the initial receive handler is pinning the connection to the client. It's fine for the response to be dropped. The main server request handler handled this case but other response senders didn't. Put this error handling in the server response sender itself so that all callers are covered. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-08 09:36:07 -07:00
Zach Brown	cbe8d77f78	Prevent duplicate inode item deletion We hide I_FREEING inodes from inode lookup to avoid inversions with cluster locking. This can result in duplicate inodes structs for a given inode number. Then can both race to try and delete the same items for their shared inode number. This leads to error messages from evict_inode and could lead to corruption if they, for example, both try and free the same data extents. This adds very basic serialization so only one instance can try to delete items at a time. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	5f682dabb5	Item cache invalidation uses seqs to avoid readers The item cache has to be careful not to insert stale read items when previously dirty items have been written and invalidated while a read was in flight. This was previously done by recording the possible range of items that a reader could see based on the key range of its lock. This is disasterous when a workload operates entirely within one lock. I ran into this when testing a small number of files with massive amounts of xattrs. While any reader is in flight all pages can't be invalidated because they all intersect with the one lock that covers all the items in use. The fix is to more naturally reflect the problem by tracking the greatest item seq in pages and the earliest seq that any readers can't see. This lets invalidate only skip pages with items that weren't visible to the earliest reader. This more naturally reflects that the problem is due to the age of the items, not their position in the key space. Now only a few of the most recently modified pages could be skipped and they'll be at the end of the LRU and won't typically be visited. As an added benefit it's now much cheaper to add, delete, and test the active readers. This fix stopped rm -rf of a full system's worth of xattrs from taking minutes constantly spinning skipping all pages in the LRU to seconds of doing real removal work. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	120c2d342a	Add create_xattr_loop test tool Add a quick tool that creates xattrs in a tight loop. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	84454b38c5	Add mkfs -A for small device sizes Normally mkfs would fail if we specify meta or data devices that are too small. We'd like to use small devices for test scenarios, though, so add an option to allow specifying sizes smaller than the minumum required sizes. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	29cfa81574	Remove unused leftovers from quorum changes These forward declarations were for interfaces that have since been removed or changed and are no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	9db3b475c0	Stop log merge work earlier during unmount The forest log merge work calls into the client to send commit requests to the server. The forest is usually destroyed relatively late in the sequence and can still be running after the client is destroyed. Adding a _forest_stop call lets us stop the log merging work before the client is destroyed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	24d682bf81	Add orphan-inodes test Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	2957f3e301	Avoid warnings when evict has signals pending Killing a task can end up in evict and break out of acquiring the locks to perform final inode deletion. This isn't necessarily fatal. The orphan task will come around and will delete the inode when it is truly no longer referenced. So let's silence the error and keep track of how many times it happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00

1 2 3 4 5 ...

1470 Commits