scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-07 19:20:44 +00:00

Author	SHA1	Message	Date
Mark Fasheh	c5e6676b04	scoutfs: remove some ifdef'd out dlmglue code We can dump the ocfs2-specifics as well as any definitions that have been exported via the header file. This makes reading through and modifying dlmglue much more palatable. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-04 08:55:14 -07:00
Mark Fasheh	17c6025cb7	scoutfs: clean up some comments in lock.c We finished the lock lru work and can remove these TODO comments. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-04 08:55:14 -07:00
Zach Brown	ccf5301c37	scoutfs: add -Werror for build errors We insist on a warning free build but it's up to human diligence to discover and address warnings. We've also caught errors when compilers in automated testing saw problems that the compilers in developer environments didn't. That is, a human only could have noticed by investigating the output from successful test runs. Let's put some weight behind our promise of a warning free build and turn gcc warnings into errors. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	b6c592f099	scoutfs: don't dirty btree buffers The btree sets some buffer head flags before it writes them to satisfy submit_bh() requirements. It was setting dirty which wasn't required. Nothing every cleared dirty so those buffers sat around and were never freed. Each btree block we wrote sat around forever. Eventually the vm gets clogged up and the world backs up trying to allocate pages to write and we see massive stalls. With this fix we no longer see the 'buffers' vm stat continously grow and IO rates are consistent over time. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	4bc565be39	scoutfs: silence bulk_alloc gcc warning Some versions of gcc correctly noticed that bulk_alloc() had a case where it wouldn't initialize ret if the first segno was 0. This won't happen because the client response processing returns an error in this case. So this just shuts up the warning. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	c5ddec7058	scoutfs: more aggressively shrink items The old item shrinking was very conservative. It would only try and reclaim items from the front of the range of the oldest items in the lru. It would stop making progress if all the items in the front of the lru are in a range whose initial item can't be reclaimed. With the item shrinking not making progress memory fills with items. Eventually the system backs up behind an allocation during segment writing blocking waiting for free pages. We fix this by much more aggressively shrinking items. We now look for a region of items around the oldest items to shrink. If those fall in the middle of a range then we use the memory from the items to construct a new range and split the existing range. Now the only way we'll refuse to shrink items is if they're dirty. We have a reasonably small cap on the number of dirty items so we shoudln't get stuck. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	fd509840d4	scoutfs: use pages for seg shrink object count The VM wasn't very excited about trying to reclaim our seg count when we returned small count of the number of large segment objects available for reclaim. Each segment represents a ton of memory so we want to give the VM more visibility into the scale of the cache to encourage it to shrink it. We define the object count for the seg shrinker as the number of pages of segments in the lru. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	ccefffe74f	scoutfs: add item, range, lock alloc/free counters Add some counters to track allocation and freeing of our structures that are subject to shrinking. This lets us eyeball the counters to see if we have runaway leaks. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	15aa09b0c2	scoutfs: add shrink exit trace points Add trace points that show the incoming nr_to_scan and resulting object count for shrinker calls. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	43a2d63f79	scoutfs: replace trace_printk in bio.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	e67e500940	scoutfs: turn off tracing in dlmglue.c Put this behind a #define. Leave the asserts (mlog_bug_on_msg) though and redefine their macro to printk instead of going to the trace buffer. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	3430edb60b	scoutfs: replace trace_printk in item.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	2a07e6f642	scoutfs: replace trace_printk in data.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	deecfa0ad5	scoutfs: replace trace_printk in trans.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	285842086d	scoutfs: replace trace_printk in ioctl.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	8ad6ff9d41	scoutfs: replace trace_printk in inode.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	44a19b63c0	scoutfs: replace trace_printk in segment.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	cf3f9fee75	scoutfs: replace trace_printk in lock.c Also clean up these traces a bit and make a lock_info trace class which we can expand in a future patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	7739a0084e	scoutfs: replace trace_printk in xattr.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	87adeb9306	scoutfs: replace trace_printk in manifest.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	0d28930271	scoutfs: replace trace_printk in super.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	a5283e6f2c	scoutfs: replace trace_printk in dir.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	2c1f117bef	scoutfs: replace trace_printk in compact.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Mark Fasheh	3a5093c6ae	scoutfs: replace trace_printk in alloc.c Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-28 13:59:49 -07:00
Zach Brown	215ba7d4ad	scoutfs: more reliably set btree parent item bits Most paths correctly calculated the bits to set in a parent item by combining the bits set in the child with the half bit of the child block's position in the ring. With the exception of fixing up the parent item bits after descent by walking the path. This mistake caused the parent item half bits to be zero and prevented migrating of blocks from the old half of the ring. Fix it by introducing a helper function to calculate the parent ref item bits and consistently using it. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-20 10:26:40 -07:00
Zach Brown	42b33d616e	scoutfs: fix btree bit iteration store_pos_bits() was trying to iterate over bits that were different between the existing bits set in the item and the new bits that will be set. It used a too clever for_each helper that tried to only iterate as many times as there were bits. But it messed up and only used ffs to find the next bit for the first iteration. From then on it would iterate over bits that weren't different. This would cause the counts to be changed when the bits didn't change and end up being wildly wrong. Fix this by using a much clearer loop. It still breaks out when there are no more different bits and we're only using a few low bits so the number of iterations is tiny. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-20 10:26:40 -07:00
Zach Brown	0b15cfe7f8	scoutfs: split on btree deletion We were getting asserts during deletion that insertion while updating a parent item didn't have enough room in the block. Our btree has variable length keys. During a merge it's possible that the items moved between the blocks can result in the final key of a block changing from a small key to a large key. To update the parent ref the parent block must have as much free space as the difference in the key sizes. We ensure free space in parents during descent by trying to split the block. Deletion wasn't doing that, it was only trying to merge blocks. We need to try to split as well as merge during deletion. And we have to update the merge threshold so that we don't just split the resulting block again if it doesn't have the min free space for a new parent item. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-19 21:19:25 -07:00
Zach Brown	1012ee5e8f	scoutfs: use block mapping items Move to static mapping items instead of unbounded extents. We get more predictable data structures and simpler code but still get reasonably dense metadata. We no longer need all the extent code needed to split and merge extents, test for overlaps, and all that. The functions that use the mappings (get_block, fiemap, truncate) now have a pattern where they decode the mapping item into an allocated native representation, do their work, and encode the result back into the dense item. We do have to grow the largest possible item value to fit the worst case encoding expansion of random block numbers. The local allocators are no longer two extents but are instead simple bitmaps: one for full segments and one for individual blocks. There are helper functions to free and allocate segments and blocks, with careful coordination of, for example, freeing a segment once all of its constituent blocks are free. _fiemap is refactored a bit to make it more clear what's going on. There's one function that either merges the next bit with the currently building extent or fills the current and starts recording from a non-mergable additional block. The old loop worked this way but was implemented with a single squirrely iteration over the extents. This wasn't feasible now that we're also iterating over blocks inside the mapping items. It's a lot clearer to call out to merge or fill the fiemap entry. The dirty item reservation counts for using the mappings is reduced significantly because each modification no longer has to assume that it might merge with two adjacent contiguous neighbours. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-19 11:25:38 -07:00
Zach Brown	c4f3c26343	scoutfs: add scoutfs_item_update_dirty() Add a function for updating the value of a dirty item. Callers can use this to make changes without having to worry about failure. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-19 11:25:38 -07:00
Zach Brown	5325aff698	scoutfs: add item count update function The item cache maintains a count of the number of dirty items, keys, and values. It updates the counts as it dirties and cleans items. There are callers who want to modify the accounting directly instead of having the accounting updated as a side effect of cleaning and re-dirtying the item. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-19 11:25:38 -07:00
Zach Brown	7854471475	scoutfs: fix server wq destory warning We were seeing warnings in destroy_workqueue() which meant that work was queued on the server workqueue after it was drained and before it was finally destroyed. The only work that wasn't properly waited for was the commit work. It looks like it'd be idle because the server receive threads all wait for their request processing work to finish. But the way the commit work is batched means that a request can have its commit processed by executing commit work while leaving the work queued for another run. Fix this by specifically waiting for the commit work to finish after the server work has waited for all the recv and compaction work to finish. I wasn't able to reliably trigger the assertion in repeated xfstests runs. This survived many runs also, let's see if it stops the destroy_workqueue() assertion from triggering in the future. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-12 15:22:03 -07:00
Zach Brown	f0a7c4f294	scoutfs: make trans item count const rhs The item count estimate functions didn't obviously differentiate between adding to a count and resetting it. Most callers initialized the count struct to 0 on the stack, incremented their estimate once, and passed it in. The problem is that those same functions that increment once in callers are also used in other estimates to build counts based on multiple operations. This tripped up the data truncate path. It looped and kept incrementing its count while truncating a file with lots of extents until the count got so large that it didn't fit in a segment by itself and blocked forever. This cleans up the item count code so that it's much harder to get wrong. We differentiate between the SIC_() high level count estimates that are meant to be passed in to _hold_trans(), and the internal __count_() functions which are used to add up the item counts that make up an aggregate operation. With this fix the only way to use the count in extent truncation is to correctly reset it for the item count for each transacation. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-12 10:41:55 -07:00
Zach Brown	e165d89f7f	scoutfs: warn on invalid item counts We had a bug where a caller was slowly increasing their item count for every transaction they attempted in a loop. Eventually the item count grew to be too large to fit in a segment and they slept indefinitely. Let's warn on invalid and impossibly large item counts as we enter transactions. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-12 10:41:55 -07:00
Zach Brown	ba40899e84	scoutfs: remove scoutfs_trans_wake_holders() This was used by compaction to wake local holders who were waiting for compaction to free up level 0 segments for them to enter the transaction. Throttling level 0 segment writes works differently now and doesn't involve blocking trans holders. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-12 10:41:55 -07:00
Mark Fasheh	5fa97018e5	scoutfs: get rid of dlmglues ocfs2_uninit_super All this does is uninitialize the dlmglue debug infrastructure, however we already do that in ocfs2_dlm_shutdown(), resulting in some double frees on unmount. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-11 21:57:31 -07:00
Zach Brown	fbfbe910aa	scoutfs: return error from lock_name_keys xfstests generic/028 was crashing dereferencing NULL locks. It'd hit either rename trying to refresh an inode with a NULL lock or lookup trying to pass a NULL lock's end to item lookup. The addition of the lock LRU fixed a bug in lock_name_keys() where it wouldn't drop a lock when _cluster_lock() returned an error. But it always returned 0 instead of returning the error. Returning 0 without setting the lock caused the callers to deref their NULL locks. We also forcefully NULL the lock at the start of the function. It was lucky that callers had already NULLed their locks. If they hadn't they would have been following random on-stack memory and it might have been harder to debug. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-11 09:27:57 -07:00
Mark Fasheh	b1fff0997e	scoutfs: dlmglue should initialize res->l_debug_list We're missing initialization of this field. It should never cause a problem today because we always do a list add immediately afterwards but let's be extra careful here and initialize it just in case. We also add a sanity check in ocfs2_add_lockres_tracking() that the lockres hasn't already been put on the debug list. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-08 13:29:56 -07:00
Mark Fasheh	f276771d8c	scoutfs: we need to uninitialize the dlmglue lockres The dlmglue lockres gets put on a debugging list at initialization time, and taken off the list at uninit time. We were missing the uninit portion of this cycle, causing some list debugging warnings. Call ocfs2_lock_res_free() in free_scoutfs_lock(). In addition, we had a raw kfree() of a scoutfs lock in find_alloc_scoutfs_lock() which also needed to be replaced. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-08 13:29:56 -07:00
Zach Brown	2d6d113e03	scoutfs: continue index walk after lock We saw inode index queries spinning. They were finding no cached entries in their locked region but the next key in the segments was in the region. This can happen if an item has been deleted in the current transaction. The query won't walk up in to the new dirty seq but it will try to walk the old seq. The item will still be in the segments but won't be visible to item_next because it's marked deleted. The query will spin finding the next stale key to read from and finding it missing in the cache. This is fixed by taking the current coherent cache at its word. When it tells us there's no entries we advance the key to check the manifest for to past the locked region. In this case it'll skip past the cached delete item and look for the next key in the segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-08 10:27:07 -07:00
Mark Fasheh	9461104f8e	scoutfs: Use LRU for locks The move to dlmglue necessitating removing the old lru code. That's fine, it didn't work anyway. We can't drop locks in the shrinker directly, so instead we have the shrinker put them on a workqueue where they are dropped. The rules for the LRU are simple. Locks get a users count. Any process that holds the lock or is in the process of acquiring the lock increments this count. When unlock is called, the count is decremented. We can use the value of this count to manage the LRU - scoutfs_unlock puts locks on the LRU when the count reaches zero, lock_name_keys() always takes them off. If the lock is selected for reclaim, callers wanting to use the lock will need to wait. We acheive this with a pair of flags. SCOUTFS_LOCK_RECLAIM is used to indicate the the lock is now queued for reclaim. Once the is ready to be destroyed, we set SCOUTFS_LOCK_DROPPED flag, telling callers to put the lock and retry their rbtree search. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-08 09:57:04 -07:00
Mark Fasheh	4bb5cadaf0	scoutfs: remove dead code in lock.[ch] With the dlmglue transition over, we can finally remove some of the no longer used portions of lock.[ch]. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-09-08 09:57:04 -07:00
Zach Brown	79110a74eb	scoutfs: prevent partial block stage, except final The staging ioctl is just a thin wrapper around writing. If we allowed partial-block staging then the write would zero a newly allocated block and only stage in the partial region of the block, leaving zeros in the file that didn't exist before. We prevent staging when the starting offset isn't block aligned. We prevent staging when the final offset isn't block aligned unless it matches the size because the stage ends in the final partial block of the file. This is verified by an xfstest (scoutfs/003) that is in flight. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-07 13:49:37 -07:00
Zach Brown	e7b5cd4c66	scoutfs: limit get_block bh use We were seeing __block_write_begin spin when staging writes were called after a read of an offline region saw an error. It turns out that the way __block_write_begin iterates through buffer heads on a page will livelock if b_size is 0. Our get_block was clearing b_blocknr and b_size before doing anything. It'd set them when it allocated blocks or found existing mapped blocks. But it'd leave them 0 on an error and trigger this hang. So we'll back off and only do the same things to the result bh that ext2/3 do, presumably that's what's actually supported. We only set mapped, set or clear new, and set b_size to less than the input b_size. While we're at it we remove a totally bogus extent flag check that's done before seeing if the next extent we found even intersects with the logical block that we're searching for. The extra test is performed again correctly inside the check for the extents overlapping. It is an artifact from the days when the "extents" were a single block and didn't need to check for overlaps. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-07 13:49:37 -07:00
Zach Brown	51e8b614e5	scoutfs: stop livelocking in item_next scoutfs_item_next() could livelock given the right key and segment key boundaries. This was easiest to trigger with an fio command that wrote a lot of data: fio --filesize=100m --nrfiles=25 --name=100m --numjobs=1 \ --iodepth=1 --ioengine=sync --fallocate=0 \ --rw=write --openfiles=256 \ --directory=$TEST_DIR There were two problems. First, if it found a cached region that didn't contain a next item it would try to read the end of the existing cached region instead of trying to populate more items by reading from the key past the existing cached region. This is fixed by incrementing the key to read from after setting it to the end of the cached region. Second, it got totally confused by non-merged but adjacent cached regions. It would find a cached region that contains the search key and try to read from the key after that region, but that key could also be cached and just not merged with its previous region. This is fixed by (duh) having an allocated pos key that we set as we walk through cached regions. It used to always try and read from the search key which was bonkers. With these fixes fio now completes. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-05 14:22:47 -07:00
Zach Brown	82f8daaebf	scoutfs: print trailing key bytes The key printing functions only output the key material that's described by the format. We have some callers that need to increment or decrement keys so they expand them to full size keys. This expansion and extra high precision low significance was hidden from the traces. This adds a helper that prints the key material with the format and then appends an encoding of the trailing bytes. The key printer was a huge mess of cases and ifs that made it hard to integrate a sane helper. We also take the opportunity to break it up into zone\|type key printer functions. The isolation makes it much clearer to see what's going on. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-05 14:22:47 -07:00
Zach Brown	2eddeb5db4	scoutfs: delete unused net key types Remove these key types from the format which haven't been used for a while. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-05 14:22:47 -07:00
Zach Brown	76cf28b442	scoutfs: warn if lock with trans held We can't block on a lock while holding the transaction open because that'd stop lock downconversion from syncing to write out items while it is converting from EX. Add a warning if we try to acquire a blocking lock while holding a transaction. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Zach Brown	599269e539	scoutfs: don't return uninit index entries Initially the index walking ioctl only ever output a single entry per iteration. So the number of entries to return and the next entry pointer to copy to userspace were maintained in the post-increment of the for loop. When we added locking of the index item results we made it possible to not copy any entries in a loop iteration. When that happened the nr and pointer would be incremented without initializing the entry. The ioctl caller would see a garbage entry in the results. This was visible in scoutfs/002 test results on a volume that had an interesting file population after having run through all the other scoutfs tests. The uninitialized entries would show up as garbage in the size index portion of the test. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Zach Brown	a8db7e5b74	scoutfs: stop iteration at lock end value If item iteration finds a hole in the cache it tries to read items. After the items are read it can look at the cached region and return items or -ENOENT. We recently added an end key to limit how far we can read and cache items. The end key addition correctly limited the cache read to the lock end value. It could never read item cache ranges beyond that. This means that it can't iterate past the end value and should return -ENOENT if it gets past end. But the code forgot to do that, it only checked for iteration past last before returning -ENOENT. It spins continually finding a hole past end but inside last, tries to read items but limits them to end, then finds the same hole again. Triggering this requires a lock end that's nearer than the last iteration key. That's hard to do because most of our item reads are covered by inode group locks which extend well past iteration inside a given inode. Inode index item can easily trigger this if there's no items. I tripped over it when walking empty indexes (data_seq or online_blocks with no regular files). Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Zach Brown	ca78757ca5	scoutfs: more careful client connect timeouts The client connection loop was a bit of a mess. It only slept between retries in one particular case. Other failures to connect would spin and livelock. It would spin forever. This fixed loop now has a much more orderly reconnect procedure. Each connecting sender always tries once. Then retry attempts backoff exponentially, settling at a nice long timeout. After long enough it'll return errors. This fixes livelocks in the xfstests that mount and unmount around dm-flakey config. generic/{034,039,040} would easily livelock before this fix. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00

1 2 3 4 5 ...

459 Commits