scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-30 09:56:55 +00:00

Author	SHA1	Message	Date
Auke Kok	0360462a35	Clear ref_blkno output when block is already dirty block_dirty_ref() skipped setting *ref_blkno when the block was already dirty, leaving the caller with a stale value. Set it to 0 on the already-dirty fast path so callers do not try to free a block that was not allocated. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-04-22 13:49:19 -07:00
Auke Kok	6ec131da03	Add cond_resched in block_free_work I'm seeing consistent CPU soft lockups in block_free_work on my bare metal system that aren't reached by VM instances. The reason is that the bare metal machine has a ton more memory available causing the block free work queue to grow much larger in size, and then it has so much work that it can take 30+ seconds before it goes through it all. This is all with a debug kernel. A non debug kernel will likely zoom through the outstanding work here at a much faster rate. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-04-22 13:49:11 -07:00
Auke Kok	d76a217ff8	Set BLOCK_BIT_ERROR on bio submit failure during forced unmount block_submit_bio will return -ENOLINK if called during a forced shutdown, the bio is never submitted, and thus no completion callback will fire to set BLOCK_BIT_ERROR. Any other task waiting for this specific bp will end up waiting forever. To fix, fall through to the existing block_end_io call on the error path instead of returning directly. That means moving the forcing_unmount check past the setup calls so block_end_io's bookkeeping stays balanced. block_end_io then sets BLOCK_BIT_ERROR and wakes up waiters just as it would on a failed async completion. Signed-off-by: Auke Kok <auke.kok@versity.com>	2026-04-22 13:49:08 -07:00
Zach Brown	6a70ee03b5	Dump block alloc stacks for leaked blocks The typical pattern of spinning isolating a list_lru results in a livelock if there are blocks with leaked refcounts. We're rarely seeing this in testing. We can have a modest array in each block that records the stack of the caller that initially allocated the block and dump that stack for any blocks that we're unable to shrink/isolate. Instead of spinning shrinking, we can give it a good try and then print the blocks that remain and carry on with unmount, leaking a few blocks. (Past events have had 2 blocks.) Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 16:16:58 -07:00
Zach Brown	89387fb192	Use list_lru for block cache shrinking The block cache had a bizarre cache eviction policy that was trying to avoid precise LRU updates at each block. It had pretty bad behaviour, including only allowing reclaim of maybe 20% of the blocks that were visited by the shrinker. We can use the existing list_lru facility in the kernel to do a better job. Blocks only exhibit contention as they're allocated and added to per-node lists. From then on we only set accessed bits and the private list walkers move blocks around on the list as we see the accessed bits. (It looks more like a fifo with lazy promotion than a "LRU" that is actively moving list items around as they're accessed.) Using the facility means changing how we remove blocks from the cache and hide them from lookup. We clean up the refcount inserted flag a bit to be expressed more as a base refcount that can be acquired by whoever's removing from the cache. It seems a lot clearer. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-29 10:12:52 -07:00
Chris Kirby	c72bf915ae	Use ENOLINK as a special error code during forced unmount Tests such as quorum-heartbeat-timeout were failing with EIO messages in dmesg output due to expected errors during forced unmount. Use ENOLINK instead, and filter all errors from dmesg with this errno (67). Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-22 10:58:44 -07:00
Auke Kok	0b7b9d4a5e	Avoid trigger munching of block_remove_stale trigger. It's entirely likely that the trigger here is munched by a read on a dirty block from any unrelated or background read. Avoid that by putting the trigger at the end of the condition list. Now that the order is swapped, we have to avoid a null deref in block_is_dirty(bp) here, as well. Signed-off-by: Auke Kok <auke.kok@versity.com>	2025-10-06 12:27:25 -05:00
Auke Kok	546b437df7	Shrinkers are now registered with a name. v5.19-rc4-52-ge33c267ab70d Adds shrinker names to the registration call to aid with shrinker debugging, which is highly opaque. To enable you'll have to recompile the kernel with CONFIG_SHRINKER_DEBUG=y though, since it's disabled by default in OSV kernels. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	2d58ee2a37	Account for new bio_alloc() args. Block device and opf are now passed through and set. We mimic compat code to do the same. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	1f0dd7f025	__vmalloc defaults to PAGE_KERNEL everywhere, so the arg was removed. v5.7-523-g88dca4ca5a93 __vmalloc no longer has the 3rd argument. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	c30172210f	Use blk_opf_t to pass bio op flags Compat is back to unsigned int. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Auke Kok	0204e092e4	FIELD_SIZEOF was deprecated. We could use sizeof_field as a direct replacement (which is the same) except that this entire thing can directly use offsetofend(). Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Zach Brown	48716461e4	Add tracepoint as block read returns ESTALE Block reads can return ESTALE naturally as mounts read through old cached blocks. We won't always log it as an error but we should add a tracepoint that can be inspected. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-10 11:03:38 -07:00
Zach Brown	cca4fcb788	Use count/scan objects shrinking interface Move to the more recent interfaces for counting and scanning cached objects to shrink. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	28f03d3558	Use more modern bio interfaces Move towards modern bio intefaces, while unfortunately carrying along a bunch of compat functions that let us still work with the old incompatible interfaces. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	4275f6e6e5	Use memalloc_nofs_save memalloc_nofs_save() was introduced as preferential to trying to use GFP flags to indicate that a task should not recurse during reclaim. We use it instead of the _noio_ we were using before. Signed-off-by: Zach Brown <zab@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	acafb869e7	Avoid deadlock from block reclaim in rht resize The RCU hash table uses deferred work to resize the hash table. There's a time during resize when hash table iteration will return EAGAIN until resize makes more progress. During this time resize can perform GFP_KERNEL allocations. Our shrinker tries to iterate over its RCU hash table to find blocks to reclaim. It tries to restart iteration if it gets EAGAIN on the assumption that it will be usable again soon. Combine the two and our shrinker can get stuck retrying iteration indefinitely because it's shrinking on behalf of the hash table resizing that is trying to allocate the next table before making iteration work again. We have to stop shrinking in this case so that the resizing caller can proceed. Signed-off-by: Zach Brown <zab@versity.com>	2023-06-15 14:45:26 -07:00
Zach Brown	464de56d28	Add stale block read retrying helper Many readers had little implementations of the logic to decide to retry stale reads with different refs or decide that they're persistent and return hard errors. Let's move that into a small helper. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	929703213f	Add fsid sbi field A few paths throughout the code get the fsid for the current mount by using the copy of the super block that we store in the scoutfs_sb_info for the mount. We'd like to remove the super block from the sbi and it's cleaner to have a specific constant field for the fsid of the mount which will not change. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	e9d04dcf8d	Add forced unmount support Add super_ops->umount_begin so that we can implement a forced unmount which tries to avoid issuing any more network or storage ops. It can return errors and lose unsynchronized data. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:02:20 -07:00
Zach Brown	841d22e26e	Disable task reclaim flags for block cache vmalloc Even though we can pass in gfp flags to vmalloc it eventually calls pte alloc functions which ignore the caller's flags and use user gfp flags. This risks reclaim re-entering fs paths during allocations in the block cache. These allocs that allowed reclaim deep in the fs was causing lockdep to add RECLAIM dependencies between locks and holler about deadlocks. We apply the same pattern that xfs does for disabling reclaim while allocating vmalloced block payloads. Setting PF_MEMALLOC_NOIO causes reclaim in that task to clear __GFP_IO and __GFP_FS, regardless of the individual allocation flags in the task, preventing recursion. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	accd680a7e	Fix block setup always returning 0 Another case of returning 0 instead of ret. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Zach Brown	c3290771a0	Block cache use rht _lookup_ insert for EEXIST The sneaky rhashtable_insert_fast() can't return -EEXIST despite the last line of the function REALLY making it look like it can. It just inserts new objects at the head of the bucket lists without comparing the insertion with existing objects. The block cache was relying on insertion to resolve duplicate racing allocated blocks. Because it couldn't return -EEXIST we could get duplicate cached blocks present in the hash table. rhashtable_lookup_insert_fast() fixes this by actually comparing the inserted objects key with the objects found in the insertion bucket. A racing allocator trying to insert a duplicate cached block will get an error, drop their allocated block, and retry their lookup. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 09:24:23 -07:00
Zach Brown	cf3cb3f197	Wait for rhashtable to rehash on insert EBUSY The rhashtable can return EBUSY if you insert fast enough to trigger an expansion of the next table size that is waiting to be rehashed in an rcu callback. If we get EBUSY from rhasthable_insert we call synchronize_rcu to wait for the rehash to complete before trying again. This was hit in testing restores of a very large namespace and took a few hours to hit. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 09:24:23 -07:00
Zach Brown	9ee7f7b9dc	Block cache shrink restart waits for rcu callbacks We're seeing cpu livelocks in block shrinking where counters show that a single block cache shrink call is only getting EAGAIN from repeated rhashtable walk attempts. It occurred to me that the running task might be preventing an RCU grace period from ending by never blocking. The hope of this commit is that by waiting for rcu callbacks to run we'll ensure that any pending rebalance callback runs before we retry the rhashtable walk again. I haven't been able to reproduce this easily so this is a stab in the dark. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-07 12:50:50 -07:00
Zach Brown	2d393f435b	Warn on leaked block refs on unmount By the time we get to destroying the block cache we should have put all our block references. Warn as we tear down the blocks if we see any blocks that still have references, implying a ref leak. This caught a leak caused by srch compaction forgetting to put allocator list block refs. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-01 13:04:06 -07:00
Zach Brown	208c51d1d2	Update stale block reading test The previous test that triggered re-reading blocks, as though they were stale, was written in the era where it only hit btree blocks and everything else was stored in LSM segments. This reworks the test to make it clear that it affects all our block readers today. The test only exercise the core read retry path, but it could be expanded to test callers retrying with newer references after they get -ESTALE errors. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:50:00 -08:00
Zach Brown	9450959ca4	Protect stale block readers from local dirtying Our block cache consistency mechanism allows readers to try and read stale block references. They check block headers of the block they read to discover if it has been modified and they should retry the read with newer block references. For this to be correct the block contents can't change under the readers. That's obviously true in the simple imagined case of one node writing and another node reading. But we also have the case where the stale reader and dirtying writer can be concurrent tasks in the same mount which share a block cache. There were a two failure cases that derive from the order of readers and writers working with blocks. If the reader goes first, the writer could find the existing block in the cache and modify it while the reader assumes that it is read only. The fix is to have the writer always remove any existing cached block and insert a newly allocated block into the cache with the header fields already changed. Any existing readers will still have their cached block references and any new readers will see the modified headers and return -ESTALE. The next failure comes from readers trying to invalidate dirty blocks when they see modified headers. They assumed that the existing cached block was old and could be dropped so that a new current version could be read. But in this case a local writer has clobbered the reader's stale block and the reader should immediately return -ESTALE. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:59 -08:00
Zach Brown	6237f0adc5	Add _block_dirty_ref to dirty blocks in one place To create dirty blocks in memory each block type caller currently gets a reference on a created block and then dirties it. The reference it gets could be an existing cached block that stale readers are currently using. This creates a problem with our block consistency protocol where writers can dirty and modify cached blocks that readers are currently reading in memory, leading to read corruption. This commit is the first step in addressing that problem. We add a scoutfs_block_dirty_ref() call which returns a reference to a dirtied block from the block core in one call. We're only changing the callers in this patch but we'll be reworking the dirtying mechanism in an upcoming patch to avoid corrupting readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	0969a94bfc	Check one block_ref struct in block core Each of the different block types had a reading function that read a block and then checked their reference struct for their block type. This gets rid of each block reference type and has a single block_ref type which is then checked by a single ref reading function in the block core. By putting ref checking in the core we no longer have to export checking the block header crc, verifying headers, invalidating blocks, or even reading raw blocks themseves. Everyone reads refs and leaves the checking up to the core. The changes don't have a significant functional effect. This is mostly just changing types and moving code around. (There are some changes to visible counters.) This shares code, which is nice, but this is putting the block reference checking in one place in the block core so that in a few patches we can fix problems with writers dirtying blocks that are being read. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	b1b75cbe9f	Fix block cache shrink and read racing crash The block cache wasn't safely racing readers walking the rcu radix_tree and the shrinker walking the LRU list. A reader could get a reference to a block that had been removed from the radix and was queued for freeing. It'd clobber the free's llist_head union member by putting the block back on the lru and both the read and free would crash as they each corrupted each other's memory. We rarely saw this in heavy load testing. The fix is to clean up the use of rcu, refcounting, and freeing. First, we get rid of the LRU list. Now we don't have to worry about resolving racing accesses of blocks between two independent structures. Instead of shrinking walking the LRU list, we can mark blocks on access such that shrinking can walk all blocks randomly and expect to quickly find candidates to shrink. To make it easier to concurrently walk all the blocks we switch to the rhashtable instead of the radix tree. It also has nice per-bucket locking so we can get rid of the global lock that protected the LRU list and radix insertion. (And it isn't limited to 'long' keys so we can get rid of the check for max meta blknos that couldn't be cached.) Now we need to tighten up when read can get a reference and when shrink can remove blocks. We have presence in the hash table hold a refcount but we make it a magic high bit in the refcount so that it can be differentiated from other references. Now lookup can atomically get a reference to blocks that are in the hash table, and shrinking can atomically remove blocks when it is the only other reference. We also clean up freeing a bit. It has to wait for the rcu grace period to ensure that no other rcu readers can reference the blocks its freeing. It has to iterate over the list with _safe because it's freeing as it goes. Interestingly, when reworking the shrinker I noticed that we weren't scaling the nr_to_scan from the pages we returned in previous shrink calls back to blocks. We now divide the input from pages back into blocks. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:15 -08:00
Andy Grover	cf278f5fa0	scoutfs: Tidy some enum usage Prefer named to anonymous enums. This helps readability a little. Use enum as param type if possible (a couple spots). Remove unused enum in lock_server.c. Define enum spbm_flags using shift notation for consistency. Rename get_file_block()'s "gfb" parameter to "flags" for consistency. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-30 13:35:44 -08:00
Andy Grover	9f151fde92	scoutfs: Use separate block devices for metadata and data Require a second path to metadata bdev be given via mount option. Verify meta sb matches sb also written to data sb. Change code as needed in super.c to allow both to be read. Remove check for overlapping meta and data blknos, since they are now on entirely separate bdevs. Use meta_bdev for superblock, quorum, and block.c reads and writes. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-19 11:41:20 -08:00
Zach Brown	f48112e2a7	scoutfs: allocate contig block pages with nowarn We first attempt to allocate our large logically contiguous cached blocks with physically contiguous pages to minimize the impact on the tlb. When that fails we fall back to vmalloc()ed blocks. Sadly, high-order page allocation failure is expected and we forgot to provide the flag that suppresses the page allocation failure message. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	0a47e8f936	Revert "scoutfs: add block visited bit" The radix allocator no longer uses the block visited bit because it maintains its own much richer private per-block data stored off the priv pointer. Signed-off-by: Zach Brown <zab@versity.com> This reverts commit 294b6d1f79e6d00ba60e26960c764d10c7f4b8a5.	2020-08-26 14:39:12 -07:00
Zach Brown	e6ae397d12	Revert "scoutfs: switch block cache to rbtree" We had switched away from the radix_tree because we were adding a _block_move call which couldn't fail. We no longer need that call, so we can go back to storing cached blocks in the radix tree which can use RCU lookups. This revert has some conflict resolution around recent commits to add the IO_BUSY block flag and the switch to _LG_ blocks. This reverts commit 10205a5670dd96af350cf481a3336817871a9a5b. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	e5f5ee2679	Revert "scoutfs: add scoutfs_block_move" We add _block_move for the radix allocator, but it no longer needs it. This reverts commit 6bb0726689981eb9699296ae2cb4c8599add5b90.	2020-08-26 14:39:12 -07:00
Zach Brown	177af7f746	scoutfs: use larger metadata blocks Introduce different constants for small and large metadata block sizes. The small 4KB size is used for the super block, quorum blocks, and as the granularity of file data block allocation. The larger 64KB size is used for the radix, btree, and forest bloom metadata block structures. The bulk of this are obvious transitions from the old single constant to the appropriate new constant. But there are a few more involved changes, though just barely. The block crc calculation now needs the caller to pass in the size of the block. The radix function to return free bytes instead returns free blocks and the caller is responsible for knowing how big its managed blocks are. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	44ac668afa	scoutfs: add small private block io read and write Add two quick functions which perform IO on small fixed size 4K blocks to or from the caller's buffer with privately allocated pages and bios. Callers have no interaction with each other. This matches the behaviour expected by callers of scoutfs_read_super and _write_super. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-07 14:06:00 -07:00
Zach Brown	757ee85520	scoutfs: don't lose block wakeups The block end_io path could lose wakeups. Both the bio submission task and a bio's end_io completion could see an io_count > 1 and neither would set the block uptodate before dropping their io_count and waking. It got into this mess because readers were waiting for io_count to drop to 0. We add a io_busy bit which indicates that io is still in flight which waiters now wait for. This gives the final io_count drop a chance to do work before clearing io_busy and dropping their reference before waking. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-28 11:34:02 -08:00
Zach Brown	8681f920e0	scoutfs: add scoutfs_block_move Add a call to move a block's location in the cache without failure. The radix allocator is going to use this to dirty radix blocks while making atomic changes to multipls paths through multiple radix trees. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	809d4be58e	scoutfs: switch block cache to rbtree Switch the block cache from indexing blocks in a radix tree to using an rbtree. We lose the RCU lookups but we gain being able to move blocks around in the cache without allocation failure. And we no longer have the problem of not being able to index large blocks with a 32bit long radix key. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	05a8573054	scoutfs: add block visited bit Add functions for callers to maintain a visited bit in cached blocks. The radix allocator is going to use this to count the number of clean blocks it sees across paths through the radix which can share parent blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	986e66d6c6	scoutfs: add block tracing Add tracing of operations on our block cache. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	d20c950c17	scoutfs: restore our block cache Previous versions of the system had a simple block cache. This brings it back with support for blocks that are larger than page size, a more efficient LRU, and an explicit writer context. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	97cb75bd88	Remove dead btree, block, and buddy code Remove all the unused dead code from the previous btree block design. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	6fd5396fbe	Add block cache shrinker Now that we have our own allocated block cache struct we need to add a shrinker so that it's reclaimed under memory pressure. We keep clean blocks in a simple lru list that the shrinker walks to free the oldest blocks. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:07 -08:00
Zach Brown	d71f7a24ec	Don't check meta seq before locking The block lock functions were trying to compare the block header seq and the super seq to decide if the block is stable and if it should lock, or not. Readers trying to lock races with transaction commits. Transaction commit can update the super after the reader locks and before it unlocks. The unlock will then fail the test and fail to unlock. fsstress triggered this in xfstests generic/013. Instead we can always acquire the read lock on stable blocks. We'll be bouncing the rwsem cacheline around like the refcount cacheline. If this is a problem we can carefully maintain bits in the block to safely indicate if it should be locked or unlocked but let's not go there if we don't have to. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:07 -08:00

1 2

73 Commits