scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-06-03 02:06:22 +00:00

Author	SHA1	Message	Date
Zach Brown	fc50072cf9	scoutfs: store manifest entries in the btree Convert the manifest to store entries in persistent btree keys and values instead of using the rbtree in memory from the ring. The btree doesn't have a sort function. It just compares variable length keys. The most complicated part of this transformation is dealing with the fallout of this. The compare function can't compare different search keys and item keys so searches need to construct full synthetic btree keys to search. It also can't return different comparisons, like overlaping, so the caller needs to do a bit more work to use key comparisons to find overlapping segments. And it can't compare differently depending on the level of the manifest so we store the manifest in keys differently depending on whether its in level 0 or not. All mount clients can now see the manifest blocks. They can query the manifest directly when trying to find segments to read. We can get rid of all the networking calls that were finding the segments for readers. We change the manifest functions that relied on the ring that the to make changes in the manifest persistent. We don't touch the allocator or the rest of the manifest server, though, so this commit breaks the world. It'll be restored in future patches as we update the segment allocator and server to work with the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	3eaabe81de	scoutfs: add btree stored in persistent ring Add a cow btree whose blocks are stored in a persistently allocated ring. This will let us incrementally index very large data sets efficiently. This is an adaptation of the previous btree code which now uses the ring, stores variable length keys, and augments the items with bits that ored up through parents. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Mark Fasheh	eb439ccc01	scoutfs: s/lck/lock/ lock.[ch] Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:18:58 -05:00
Mark Fasheh	136cbbed29	scoutfs: only release lockspace/workqueues in lock_destroy if they exist Mount failure means these might be NULL. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:17:22 -05:00
Mark Fasheh	19f6f40fee	scoutfs: get rid of held_locks construct Now that we have a dlm, this is a needless redirection. Merge all fields back into the lock_info struct. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:17:22 -05:00
Mark Fasheh	250e9d2701	scoutfs: remove unused function, can_complete_shutdown Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:17:22 -05:00
Mark Fasheh	e6f3b3ca8f	scoutfs: add lock caching We refcount our locks and hold them across system calls. If another node wants access to a given lock we'll mark it as blocking in the bast and queue a work item so that the lock can later be released. Otherwise locks are free'd under memory pressure, unmount or after a timer fires. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:15:11 -05:00
Zach Brown	76a73baefd	scoutfs: don't lose items between segments The refactoring of compaction to use the skip lists changed the nature of item insertion. Previously it would precisely count the number of items to insert. Now it discovers that the current output segment is full by having _append_item() return false. In this case the cursors currently point to the item that would have been inserted but failed. compact_items() caller loops around to allocate the next segment. Then it calls compact_items() again and it mistakenly advances past the current item that still needed to be inserted. Hiding next_item() away from the segment loop made it hard to see this mechanism. Let's drop the compact_items() function and bring item iteration and item appending into the main loop so we can more carefully advance or not as we write and allocate new segments. This stops losing items at segment boundaries. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	e7655f00ee	scoutfs: read items from next segment in level If the starting key for a segment read doesn't fall in a segment then we have to include the next segment from that level in the read. If we don't then the read can think that there are no more items at that level and assume that all the items in the upper level are all that exist. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	f7701177d2	scoutfs: throttle addition of level 0 segments Writers can add level 0 segments much faster (~20x) than compaction can compact them down into the lower levels. Without a limit on the number of level 0 segments item readind can try to read an extraordinary number of level 0 segments and wedge the box nonreclaimable page allocations. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	9f545782fb	scoutfs: add missing segment put Back when we changed the transaction commit to ask the server to update the commit we accidentally lost the put of the level0 segment that was just written. This leaked refcount would pin segments over time and eventually drag the box into crippling oom. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	823a5bed34	scoutfs: add some segment cache life cycle tracing Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	70c7178e6a	scoutfs: index segment items with skip list We want to be able to read a region of items from a segment by searching for the key that starts the item. In the first version of the segment format we find a key by performing a binary search across an array of offsets that point to the items. Unfortunately the current format requires that we know the number of items before we start writing. With thousands of items per segment it's a little bonkers to ask compaction to walk through all the items twice. Worse still, we didn't want the item offset array entries to span pages so they're rounded up to a power of two after having seqs and offsets and lengths. This makes them surprisingly large and sometimes they can consume up to 60% (!) of a segment. We know that we're inserting in sort order so it's very easy to build an index as we insert. Skip lists give us a nice simple way to ensure o(log n) lookups with only an average of two links per node. CPU use is greatly reduced by removing a full redundant item walk and we know use up almost all of the space in segments. There's still little gaps at the ends of blocks as item's still won't cross block boundaries. Most of this change is safely mechanical. The big difference is in how the compaction loop is built. It used to count the items before hand. It would never try to append when out of segments and writing would stop after the exact number of items. Now it discovers its out of items by allocating and trying to append and finding that there's no more work to do. It required rethinking the loop exit and segment allocation and stopping conditions. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	463a696575	scoutfs: add value length limit Add a relatively small universal value size limit. This will be needed by more dense item packing to predict the worst case padding to avoid full items crossing block boundaries. We refactor the existing symlink and xattr item value limit to use this new limit. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	1724bab8ea	scoutfs: store large symlinks in multiple items We're shrinking the max item value size so we need to store symlinks with large target paths in multiple items. The arbitrary max value size defined here will be replaced in the future with the new global maximum value size. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	8d59e6d071	scoutfs: fix alloc eio for free region It's possible for the next segno to fall at the end of an allocation region that doesn't have any bits set. The code shouldn't return -EIO in that case, it should carry on to the next region. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	793f84b86b	scoutfs: remove item reading limit The item reading limit was intended to minimize latency when we were directly reading cached manifests. We're now asking the server to walk the manifest for us and that's a lot more expensive than querying local cached blocks. Let's gulp in an entire segment's worth of items if we can. We'll have plenty of opportunity to tune this down later. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	3b56161ed3	scoutfs: fix item read seg walk limit When read items into the cache we have the range of keys that were missing from the cache. The item walk was stopping when it hit the end of the missing cache range, not when it hit the end of the keys that were covered by all the segments. This would manifest as huge regions of missing items. The read would walk off the relatively closed end of the highest level segment. It would keep reading while there were items in the upper levels but all those keys that would have been found in additional lower level segments are missing. Eventually it'd hit the end of the higher level sgement and mark that region as cached. With it fixed it now stops the read appropriately and will come around next time to read the range that coveres the next lowest level segment. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	71711c8b56	scoutfs: add manifest and item tracing Add some tracing to get visibility into compaction and item reading. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	ef551776ae	scoutfs: item cache reads shouldn't clobber dirty We read into the item cache when we hit a region that isn't cached. Inode index items are created without cached range coverage. It's easy to trigger read attempts that overlap with dirty inode index items. insert_item() had a bug where it's notion of overwriting only applied to logical presence. It always let an insertion overwrite an existing item if it was a deletion. But that only makes sense for new item creation. Item cache population can't do this. In this inode index case it can replace a correct dirty inode index item with its old pre-deletion item from the read. This clobbers the deletions and leaks the old inode index item versions. So teach the item insertion for caching to never, ever, replace an existing item. This fixes assertion failures from trying to immediately walk meta seq items after creating a few thousand dirty entries. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	bf7b3ac506	scoutfs: fix ring first_seq calculation As we write ring blocks we need to update the first_seq to point at the first live block in the ring. The existing calculation gets it wrong and stores the seq of the first block that we wrote in this commit, not the first ring block that is still live and would need to be read. Fix the calculation to so that we set first_seq to the first live block in the ring. This fixes the bug where a mount can spin printing the super it's using. This is the server trying to constantly startup as each server start fails as it can't read the ring. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	2eecbbe78a	scoutfs: add item cache key ioctls These ioctls let userspace see the items and ranges that are cached. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	5f5729b2a4	scoutfs: add sticky compaction As we write segments we're not limiting the number of segments they intersect at the next level. Compactions are limited to a fanout's worth of overlapping segments. This means that we can get a compaction where the upper level segment overlapps more than the segments that are part of the compaction. In this case we can't write the remaining upper level items at the lower level because now we can have a level with segments whose keys intersect. Instead we detect this compaction case. We call it sticky because after merging with the lower level segments the remaining items in the upper level need to stick to the upper level. The next time compaction comes around it'll compact the remaining items with the additional lower overlaping segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	d52f09449d	scoutfs: reclaim item cache Add a LRU and shrinker to reclaim old cached items under memory pressure. This is pretty awful today because of the separate cached range structs and rbtree. We do our best to blow away enough of the cache and range to try and make progress. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	94e78414f9	scoutfs: add key trace class Some item tracing functions were really just tracing a key. Refactor it into a trace class with event users. Later patches can then use the key trace class. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Mark Fasheh	e711c15acf	scoutfs: use dlm for locking To actually use it, we first have to copy symbols over from the dlm build into the scoutfs source directory. Make that happen automatically for us in the Makefile. The only users of locking at the moment are mount, unmount and xattr read/write. Adding more locking calls should be a straight-forward endeavor. The LVB based server ip communication didn't work out, and LVBS as they are written don't make sense in a range locking world. So instead, we record the server ip address in the superblock. This is protected by the listen lock, which also arbitrates which node will be the manifest server. We take and drop the dlm lock on each lock/unlock call. Lock caching will come in a future patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-23 15:08:02 -05:00
Mark Fasheh	08bf1fea79	dlm: Give fs/dlm the notion of ranges Using the new interval tree code we add a tree for each lock status list to efficiently track ranged requests. Internally, most operations on a resources lock status list (granted, waiting, converting) then are turned into operations within a given range. There is no API change other than a new call, dlm_lock_range() and a new structure, 'struct dlm_key' to define our range endpoints. Keys can have arbitrary lengths and are compared via memcmp. A ranged blocking ast type is defined so that users of dlm_lock_range() can know which range they are blocking. A rudimentary test, dlmtest.ko is included. TODO: - Update userspace entry points, need to add one for new lock call - Manage backwards compatibility with network protocol Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-23 15:07:10 -05:00
Mark Fasheh	0c1c2691e0	interval-tree: Allow user defined objects as endpoints Users pass in a comparison function which is used when endpoints need to be checked against each other. We also put each ITTYPE local definition on it's own line to facilitate the use of pointers. An upcoming dlm patch will make use of this to allow for keyed, ranged locking. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-08 18:10:40 -05:00
Mark Fasheh	dfc220ad6f	Import fs/dlm/* from linux-3.10.0-327.36.1.el7 Also wire it into the build system. We have to figure out how to get scoutfs pulling in the right headers but that can wait until we have something more usable. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-08 18:10:40 -05:00
Zach Brown	85cbe7dc97	scoutfs: add a counter add macro to match inc Just a quick wrapper around the related percpu_counter call. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-07 08:52:11 -07:00
Zach Brown	0280971fab	scoutfs: add bug on for out of order seg items We've seen some cases where compaction writes a new segment that contains items that aren't sorted. This eventually leads to read being mislead in its binary search of the items in a segment and failing to find the items it was looking for. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 15:45:15 -07:00
Zach Brown	a1dadd9763	scoutfs: lock around dirty item writing Writing dirty items into a segment wasn't protected by locking. It's not racing with item dirtying, bit it's absolutely racing with reads while modifying the rbtree. And shrinking will be modifying the item cache at any old time in the future. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 15:35:05 -07:00
Zach Brown	1652512af7	scoutfs: remove ancient dirty item comment This is just old and wrong. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 15:34:42 -07:00
Zach Brown	43e9d2caa2	scoutfs: trace compaction manifest entries Trace the manifest entries compaction received from the server. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 15:31:47 -07:00
Zach Brown	b5ee282f6b	scoutfs: minor manifest ring comparison tracing It was nice to watch the ring compare nodes so leave behind the trace and clean up the callers so that uninitialized keys are cleanly null. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 15:30:21 -07:00
Zach Brown	54d286d91c	scoutfs: format strings for all key types Add all the missing key types to scoutfs_key_str() so that we can get traces and printks of all key types. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 15:01:37 -07:00
Zach Brown	1485b02554	scoutfs: add SK_ helpers for printing keys Add some percpu string buffers so that we can pass formatted strings as arguments when printing keys. The percpu struct uses a different buffer for each argument. We wrap the whole print call in a wrapper that disables and enables preemption. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 14:48:09 -07:00
Zach Brown	a050152254	scoutfs: fix ring next/prev walk comparison test The scoutfs_ring_next() and _prev() functions had a really dumb bug where they check the sign of comparisons by comparing with 1. For example, next would miss that the walk traversed a lesser item and wouldn't return the next item. This was causing compaction to miss underlying segments, creating segments in levels that had overlapping keys, which then totally confused reading and kept it from finding the items it was looking for. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 14:32:29 -07:00
Zach Brown	79de18443b	scoutfs: don't extend key in dec_cur_len A copy and paste bug had us extending the length of keys that were decremented at their previous length. The whole point of the _cur_len functions is that they don't have to extend the key buf out to full precision. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 14:30:43 -07:00
Zach Brown	2bd698b604	scoutfs: set NODELAY and REUSEADDR on net sockets Add a helper that creates a socket and sets nodelay for all sockets and set reuseaddr in listening sockets. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 14:29:05 -07:00
Zach Brown	c84250b8c6	scoutfs: add item_set_batch trace point Restore the item_set_batch trace point by changing the current insert_batch tracepoint to a class and defining insert and set as class trace points. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-02 09:20:02 -07:00
Zach Brown	a2ef5ecb33	scoutfs: remove item_forget It's pretty dangerous to forcefully remove items without writing deletion items to lsm segments. This was only used for magical ephemeral items when we were having them store file data. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:59:24 -07:00
Zach Brown	1f933016f0	scoutfs: remove ephemeral items Ephemeral items were only used by the page cache which tracked page contents in items whose values pointed to the pages. Remove their special case. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:58:09 -07:00
Zach Brown	b7bbad1fba	scoutfs: add precise transation item reservations We had a simple mechanism for ensuring that transaction didn't create more items than would fit in a single written segment. We calculated the most dirty items that a holder could generate and assumed that all holders dirtied that much. This had two big problems. The first was that it wasn't accounting for nested holds. write_begin/end calls the generic inode dirtying path whild holding a transaction. This ended up deadlocking as the dirty inode waited to be able to write while its trans held back in write_begin prevented writeout. The second was that the worst case (full size xattr) item dirtying is enormous and meaningfully restricts concurrent transaction holders. With no currently dirty items you can have less than 16 full size xattr writes. This concurrency limit only gets worse as the transaction fills up with dirty items. This fixes those problems. It adds precise accounting of the dirty items that can be created while a transaction is held. These reservations are tracked in journal_info so that they can be used by nested holds. The precision allows much greater concurrency as something like a create will try to reserve a few hundreds bytes instead of 64k. Normal sized xattr operations won't try to reserve the largest possible space. We add some feedback from the item cache to the transaction to issue warnings if a holder dirties more items than it reserved. Now that we have precise item/key/value counts (segment space consumption is a function of all three :/) we can't have a single atomic track transaction holders. We add a long-overdue trans_info and put a proper lock and fields there and much more clearly track transaction serialization amongst the holders and writer. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:15:13 -07:00
Zach Brown	297b859577	scoutfs: deletion items maintain counts When we turned existing items into deletion items we'd remove their values. But we didn't update the count of dirty values to reflect that removal so the dirty value count would slowly grow without bound. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:12:30 -07:00
Zach Brown	5f11cdbfe5	scoutfs: add and index inode meta and data seqs For each transaction we send a message to to the server asking for a unique sequence number to associate with the transaction. When we change metadata or data of an inode we store the current transaction seq in the inode and we index it with index items like the other inode fields. The server remembers the sequences it gives out. When we go to walk the inode sequence indexes we ask the server for the largest stable seq and limit results to that seq. This ensures that we never return seqs that are past dirty items so never have inodes and seqs appear in the past. Nodes use the sync timer to regularly cycle through seqs and ensure that inode seq index walks don't get stuck on their otherwise idle seq. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:12:24 -07:00
Zach Brown	b291818448	scoutfs: add sync deadline timer Make sure that data is regularly synced. We switch to a delayed work struct that is always queued with the sync deadline. If we need an immediate sync we mod it to now. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-19 11:19:56 -07:00
Zach Brown	373def02f0	scoutfs: remove trade_time message This was mostly just a demonstration for how to add messages. We're about to add a message that we always send on mount so this becomes completely redundant. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-18 10:52:04 -07:00
Zach Brown	8ea414ac68	scoutfs: clear seg rb node after replacing When inserting a newly allocated segment we might find an existing cached stale segment. We replace it in the cache so that its user can keep using its stale contents while we work on the new segment. Replacing doesn't clear the rb_node, though, so we trip over a warning when we finally free the segment and it looks like it's still present in the rb tree. Clear the node after we replace it so that freeing sees a clear node and doesn't issue a warning. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 14:51:36 -07:00
Zach Brown	5307c56954	scoutfs: add a stat_more ioctl We have inode fields that we want to return to userspace with very low overhead. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 14:28:10 -07:00

1 2 3 4 5 ...

323 Commits