scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-08 03:30:46 +00:00

Author	SHA1	Message	Date
Mark Fasheh	4ff2148f10	scoutfs: Don't use stale root in get_manifest_refs get_manifest_refs was using the btree root in its stale copy of the super block. It is supposed to use the btree root that it was given by its caller who went to the trouble of finding a sufficiently current btree root. Signed-off-by: Mark Fasheh <mfasheh@versity.com> [zab: added commit message and fixed formatting] Signed-off-by: Zach Brown <zab@versity.com>	2017-07-27 12:32:05 -07:00
Mark Fasheh	a65b28d440	scoutfs: lock impossible ino group for listen lock Otherwise we get into a problem where the listen lock is conflicting with regular inode group requests. Since we never drop the listen lock and it (by design) blocks progress on another node, those inode group requests may hang. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-19 19:04:41 -05:00
Mark Fasheh	2d11f08f5e	scoutfs: Remove unused functions, scoutfs_[un]lock_addr Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-19 19:04:41 -05:00
Zach Brown	13ebd8d18c	scoutfs: don't use delayed downconvert work The delayed downconvert work wasn't being canceled on shutdown. 60s after unmount at least the net lock's timer would fire and crash trying to queue the delayed work on the destroyed workqueue. Proactively unlocking the locks isn't always beneficial to begin with. The relative costs of mispredicting the future are wildly different if we have to re-read item caches from segments or have to downconvert a blocking read lock. So we can just remove the delayed work to fix the bug and remove a moving piece that would need to be considered and tuned. There's still a race where we can get basts after destroying the workqueue but before we destroy the lockspace, we'll get there. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	47b26d7888	scoutfs: add end to _item_delete Add the end argument to scoutfs_item_delete() to limit how many items it will read into the cache. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	d5b4677e7f	scoutfs: add end to _dirty, _delete_many, _update These transformations are mechanical and there aren't many callers of these so we combine them into one commit. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	d78ed098a7	scoutfs: add cache reading limit to _set_batch Add an end argument to _set_batch to specify the limit of items we'll read into the cache. And it turns out that the loop in _set_batch that meant to cache all the items covered by the batch didn't try hard enough. It would stop once the first key was covered but didn't make sure that the coverage extended to cover last. This can happen if segment boundaries happen to fall within the items that make up the batch. Fix it up while we're in here. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	0b64a4c83f	scoutfs: lock inode index item iteration Add locks around inode index item iteration. This is tricky because the inode index items are enormous and we can't default to coarse locks that let it read and iterate over the entire key space. We use the manifest to find the next small fixed size region to lock and iterate from. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	f611c769e2	scoutfs: add 'end' to item_next to limit reads Add an end key to the item_next calls to limit how many items will be read into the cache. Callers typically get this from the lock they hold that covers the iteration. We differentiate between iteration and caching so that a series of small iterations (listxattr on inodes, namespace walk in small dirs) can be satisfied by a single read of adjacent items from segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	4f6f842efa	scoutfs: add inode index item locking Add a locking wrapper for the inode index items. It maps the index fields to a lock name for each index type. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	c80dd579e1	scoutfs: add scoutfs_manifest_next_key Add an item reading variant that just returns the next key that it finds in segments after the given key. This will be used while iterating to find the next key to lock and then try to iterate towards. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	19171f7a25	scoutfs: add end to _item_lookup The item cache can only be populated with items that are covered by locks. Require callers to provide the farthest key that can be covered by the locks. Locks provide a key for exactly this purpose. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	67cc4fb697	scoutfs: allow NULL end around read_items Let both check_range and read_items take a NULL end. check_range just doesn't do anything with the end of the range. read_items defaults to trying to read as many items as it can but clamps to the extent of the segments that intersect with the key. This will let us incrementally add end arguments to the item functions that are intially passed in as NULL in callers as we add lock coverage. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	11a8570117	scoutfs: remove our copy of the dlm We don't need the dlm to track key ranges if we implement ranges by mapping keys to resources which represent ranges of the key space. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	8a42a4d75a	scoutfs: introduce lock names Instead of locking one resource with ranges we'll have callers map their logical resources to a tuple name that we'll store in lock resources. The names still map to ranges for cache reading and cache invalidation but the ranges aren't exposed to the DLM. This lets us use the stock DLM and distribute resources across masters. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	6de2bfc1c5	scoutfs: use the dlm mode/levels directly We intend to use more of the dlm lock levels. Let's use its modes directly so we don't have to maintain a mental map from differently named modes. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	8d29c82306	scoutfs: sort keys by zone, then inode, then type Holding a DLM lock protects a range of the key space. The DLM locks span inodes or regions of inodes. We need the sort order in LSM items to match the DLM range keys so that we can read all the items covered by a lock into the cache from a region of LSM segments. If their orders differered then we'd have to jump around segments to find all the items covered by a given DLM lock. Previously we were sorting by type then, within types, by inode. Now we want to sort by inode then by type. But there are structures which previously had a type but weren't then sorted by inode. We introduce zones as the primary sort key. Inode index and node zones are sorted by the inode fields and node ids respectively. Then comes the fs zone first sorted by inode then the type of the key. The bulk of this is the mechanical introduction of the zone field to the keys, moving the type field down, and a bulk rename of _KEY to _TYPE. But there are some more substantial changes. The orphan keys needed to be put in a zone. They fit in the NODE zone which is all about resources that nodes hold and would need to be cleaned up if the node went away. The key formatting is significantly changed to match the new formatting. Formatted keys are now generally of the form "zone.primary.type..." And finally with the keys now properly sorted by inodes we can correctly construct a single range of item cache keys to invalidate when unlocking the inode group locks. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	690049c293	scoutfs: add GET_MANIFEST_ROOT network op We're going to need to be able to sample the current stable manifest root occasionally. We're adding it now because we don't yet have the lock plumbing that would provide the lvb. Eventually this call will bubble up into the locking and the root will be stored in the lock instead of always requested. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	412e7a7e3b	scoutfs: remove unused ring log storage Remove the old unused ring now all of its previous callers now use the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	c2f13ccf24	scoutfs: have net.c commit btree blocks Convert the net server metadata dirtying and committing code to use the btree instead of the ring. It has to be careful to setup and teardown the btree info as it starts up and shuts down the server. This fixes up some questionable setup/teardown changes made in the previous patches to convert the manifest and allocator. We could rebase the patches to merge those together. But given that the previous patches don't work at all without the net updates it might not be worth the trouble. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	ff5a094833	scoutfs: store allocator regions in btree Convert the segment allocator to store its free region bitmaps in the btree. This is a very straight forward mechanical transformation. We split the allocator region into a big-endian index key and the bitmap value payload. We're careful to operate on aligned copies of the bitmaps so that they're long aligned. We can remove all the funky functions that were needed when writing the ring. All we're left with is a call to apply the pending allocations to dirty btree blocks before writing the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	fc50072cf9	scoutfs: store manifest entries in the btree Convert the manifest to store entries in persistent btree keys and values instead of using the rbtree in memory from the ring. The btree doesn't have a sort function. It just compares variable length keys. The most complicated part of this transformation is dealing with the fallout of this. The compare function can't compare different search keys and item keys so searches need to construct full synthetic btree keys to search. It also can't return different comparisons, like overlaping, so the caller needs to do a bit more work to use key comparisons to find overlapping segments. And it can't compare differently depending on the level of the manifest so we store the manifest in keys differently depending on whether its in level 0 or not. All mount clients can now see the manifest blocks. They can query the manifest directly when trying to find segments to read. We can get rid of all the networking calls that were finding the segments for readers. We change the manifest functions that relied on the ring that the to make changes in the manifest persistent. We don't touch the allocator or the rest of the manifest server, though, so this commit breaks the world. It'll be restored in future patches as we update the segment allocator and server to work with the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	3eaabe81de	scoutfs: add btree stored in persistent ring Add a cow btree whose blocks are stored in a persistently allocated ring. This will let us incrementally index very large data sets efficiently. This is an adaptation of the previous btree code which now uses the ring, stores variable length keys, and augments the items with bits that ored up through parents. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Mark Fasheh	eb439ccc01	scoutfs: s/lck/lock/ lock.[ch] Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:18:58 -05:00
Mark Fasheh	136cbbed29	scoutfs: only release lockspace/workqueues in lock_destroy if they exist Mount failure means these might be NULL. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:17:22 -05:00
Mark Fasheh	19f6f40fee	scoutfs: get rid of held_locks construct Now that we have a dlm, this is a needless redirection. Merge all fields back into the lock_info struct. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:17:22 -05:00
Mark Fasheh	250e9d2701	scoutfs: remove unused function, can_complete_shutdown Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:17:22 -05:00
Mark Fasheh	e6f3b3ca8f	scoutfs: add lock caching We refcount our locks and hold them across system calls. If another node wants access to a given lock we'll mark it as blocking in the bast and queue a work item so that the lock can later be released. Otherwise locks are free'd under memory pressure, unmount or after a timer fires. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:15:11 -05:00
Zach Brown	76a73baefd	scoutfs: don't lose items between segments The refactoring of compaction to use the skip lists changed the nature of item insertion. Previously it would precisely count the number of items to insert. Now it discovers that the current output segment is full by having _append_item() return false. In this case the cursors currently point to the item that would have been inserted but failed. compact_items() caller loops around to allocate the next segment. Then it calls compact_items() again and it mistakenly advances past the current item that still needed to be inserted. Hiding next_item() away from the segment loop made it hard to see this mechanism. Let's drop the compact_items() function and bring item iteration and item appending into the main loop so we can more carefully advance or not as we write and allocate new segments. This stops losing items at segment boundaries. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	e7655f00ee	scoutfs: read items from next segment in level If the starting key for a segment read doesn't fall in a segment then we have to include the next segment from that level in the read. If we don't then the read can think that there are no more items at that level and assume that all the items in the upper level are all that exist. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	f7701177d2	scoutfs: throttle addition of level 0 segments Writers can add level 0 segments much faster (~20x) than compaction can compact them down into the lower levels. Without a limit on the number of level 0 segments item readind can try to read an extraordinary number of level 0 segments and wedge the box nonreclaimable page allocations. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	9f545782fb	scoutfs: add missing segment put Back when we changed the transaction commit to ask the server to update the commit we accidentally lost the put of the level0 segment that was just written. This leaked refcount would pin segments over time and eventually drag the box into crippling oom. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	823a5bed34	scoutfs: add some segment cache life cycle tracing Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	70c7178e6a	scoutfs: index segment items with skip list We want to be able to read a region of items from a segment by searching for the key that starts the item. In the first version of the segment format we find a key by performing a binary search across an array of offsets that point to the items. Unfortunately the current format requires that we know the number of items before we start writing. With thousands of items per segment it's a little bonkers to ask compaction to walk through all the items twice. Worse still, we didn't want the item offset array entries to span pages so they're rounded up to a power of two after having seqs and offsets and lengths. This makes them surprisingly large and sometimes they can consume up to 60% (!) of a segment. We know that we're inserting in sort order so it's very easy to build an index as we insert. Skip lists give us a nice simple way to ensure o(log n) lookups with only an average of two links per node. CPU use is greatly reduced by removing a full redundant item walk and we know use up almost all of the space in segments. There's still little gaps at the ends of blocks as item's still won't cross block boundaries. Most of this change is safely mechanical. The big difference is in how the compaction loop is built. It used to count the items before hand. It would never try to append when out of segments and writing would stop after the exact number of items. Now it discovers its out of items by allocating and trying to append and finding that there's no more work to do. It required rethinking the loop exit and segment allocation and stopping conditions. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	463a696575	scoutfs: add value length limit Add a relatively small universal value size limit. This will be needed by more dense item packing to predict the worst case padding to avoid full items crossing block boundaries. We refactor the existing symlink and xattr item value limit to use this new limit. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	1724bab8ea	scoutfs: store large symlinks in multiple items We're shrinking the max item value size so we need to store symlinks with large target paths in multiple items. The arbitrary max value size defined here will be replaced in the future with the new global maximum value size. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	8d59e6d071	scoutfs: fix alloc eio for free region It's possible for the next segno to fall at the end of an allocation region that doesn't have any bits set. The code shouldn't return -EIO in that case, it should carry on to the next region. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	793f84b86b	scoutfs: remove item reading limit The item reading limit was intended to minimize latency when we were directly reading cached manifests. We're now asking the server to walk the manifest for us and that's a lot more expensive than querying local cached blocks. Let's gulp in an entire segment's worth of items if we can. We'll have plenty of opportunity to tune this down later. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	3b56161ed3	scoutfs: fix item read seg walk limit When read items into the cache we have the range of keys that were missing from the cache. The item walk was stopping when it hit the end of the missing cache range, not when it hit the end of the keys that were covered by all the segments. This would manifest as huge regions of missing items. The read would walk off the relatively closed end of the highest level segment. It would keep reading while there were items in the upper levels but all those keys that would have been found in additional lower level segments are missing. Eventually it'd hit the end of the higher level sgement and mark that region as cached. With it fixed it now stops the read appropriately and will come around next time to read the range that coveres the next lowest level segment. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	71711c8b56	scoutfs: add manifest and item tracing Add some tracing to get visibility into compaction and item reading. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	ef551776ae	scoutfs: item cache reads shouldn't clobber dirty We read into the item cache when we hit a region that isn't cached. Inode index items are created without cached range coverage. It's easy to trigger read attempts that overlap with dirty inode index items. insert_item() had a bug where it's notion of overwriting only applied to logical presence. It always let an insertion overwrite an existing item if it was a deletion. But that only makes sense for new item creation. Item cache population can't do this. In this inode index case it can replace a correct dirty inode index item with its old pre-deletion item from the read. This clobbers the deletions and leaks the old inode index item versions. So teach the item insertion for caching to never, ever, replace an existing item. This fixes assertion failures from trying to immediately walk meta seq items after creating a few thousand dirty entries. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	bf7b3ac506	scoutfs: fix ring first_seq calculation As we write ring blocks we need to update the first_seq to point at the first live block in the ring. The existing calculation gets it wrong and stores the seq of the first block that we wrote in this commit, not the first ring block that is still live and would need to be read. Fix the calculation to so that we set first_seq to the first live block in the ring. This fixes the bug where a mount can spin printing the super it's using. This is the server trying to constantly startup as each server start fails as it can't read the ring. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	2eecbbe78a	scoutfs: add item cache key ioctls These ioctls let userspace see the items and ranges that are cached. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	5f5729b2a4	scoutfs: add sticky compaction As we write segments we're not limiting the number of segments they intersect at the next level. Compactions are limited to a fanout's worth of overlapping segments. This means that we can get a compaction where the upper level segment overlapps more than the segments that are part of the compaction. In this case we can't write the remaining upper level items at the lower level because now we can have a level with segments whose keys intersect. Instead we detect this compaction case. We call it sticky because after merging with the lower level segments the remaining items in the upper level need to stick to the upper level. The next time compaction comes around it'll compact the remaining items with the additional lower overlaping segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	d52f09449d	scoutfs: reclaim item cache Add a LRU and shrinker to reclaim old cached items under memory pressure. This is pretty awful today because of the separate cached range structs and rbtree. We do our best to blow away enough of the cache and range to try and make progress. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	94e78414f9	scoutfs: add key trace class Some item tracing functions were really just tracing a key. Refactor it into a trace class with event users. Later patches can then use the key trace class. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Mark Fasheh	e711c15acf	scoutfs: use dlm for locking To actually use it, we first have to copy symbols over from the dlm build into the scoutfs source directory. Make that happen automatically for us in the Makefile. The only users of locking at the moment are mount, unmount and xattr read/write. Adding more locking calls should be a straight-forward endeavor. The LVB based server ip communication didn't work out, and LVBS as they are written don't make sense in a range locking world. So instead, we record the server ip address in the superblock. This is protected by the listen lock, which also arbitrates which node will be the manifest server. We take and drop the dlm lock on each lock/unlock call. Lock caching will come in a future patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-23 15:08:02 -05:00
Mark Fasheh	08bf1fea79	dlm: Give fs/dlm the notion of ranges Using the new interval tree code we add a tree for each lock status list to efficiently track ranged requests. Internally, most operations on a resources lock status list (granted, waiting, converting) then are turned into operations within a given range. There is no API change other than a new call, dlm_lock_range() and a new structure, 'struct dlm_key' to define our range endpoints. Keys can have arbitrary lengths and are compared via memcmp. A ranged blocking ast type is defined so that users of dlm_lock_range() can know which range they are blocking. A rudimentary test, dlmtest.ko is included. TODO: - Update userspace entry points, need to add one for new lock call - Manage backwards compatibility with network protocol Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-23 15:07:10 -05:00
Mark Fasheh	0c1c2691e0	interval-tree: Allow user defined objects as endpoints Users pass in a comparison function which is used when endpoints need to be checked against each other. We also put each ITTYPE local definition on it's own line to facilitate the use of pointers. An upcoming dlm patch will make use of this to allow for keyed, ranged locking. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-08 18:10:40 -05:00
Mark Fasheh	dfc220ad6f	Import fs/dlm/* from linux-3.10.0-327.36.1.el7 Also wire it into the build system. We have to figure out how to get scoutfs pulling in the right headers but that can wait until we have something more usable. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-08 18:10:40 -05:00

1 2 3 4 5 ...

344 Commits