scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-07 11:10:44 +00:00

Author	SHA1	Message	Date
Zach Brown	9a293bfa75	Add item delete dirty and many interfaces Add item functions for deleting items that we know to be dirty and add a user in another function that deletes many items without leaving parial deletions behind in the case of errors. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	f139cf4a5e	Convert unlink and orphan processing Restore unlink functionality by converting unlink and orphan item processing from the old btree interface to the new item cache interface. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9d6d70bd89	Add an item next for key len ignoring val Add scoutfs_item_next_same() which requires that the key lengths be identical but which allows any values, including no value by way of a null kvec. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9d68e272cc	Allow creation of items with no value Item creation always tried to allocate a value. We have some item types which don't have values. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	8f63196318	Add key inc/dec variants for partial keys Some callers know that it's safe to increment their partial keys. Let them do so without trying to expand the keys to full precision and triggering warnings that their buffers aren't large enough. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	2ac239a4cb	Add deletion items So far we were only able to add items to the segments. To support deletion we have to insert deletion items and then remove them and the item they reference when their segments are compacted. As callers attempt to delete items from the item cache we replace the existing item with a deletion marker with the key but no value. Now that there are deletion items in the cache we have to teach the other item cache operations to skip them. There's some noise in the patch from moving functions around so that item insertion can free a deletion item it finds. The deletion items are written out to the segment as usual except now the in-segment item struct has a flag to mark a deletion item and the deletion item is removed from the cache once its written to the segment. Item reading knows to skip deletion items and not add them back into the cache. Compaction proceeds as usual for most of the levels with the deletion item clobbering any older higher level items with the same key. Eventually the deletion item itself is removed by skipping over it when compacting to the largest final level. We support this by adding a little call that describes the max level of the tree at the time the compaction starts so that compaction can tell when it should skip copying the deletion item to the final lower level. All of this is for deletion of items with a precise key. In the future we'll expand the deletion items so that they can reference a contiguous range of keys. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	cfc6d72263	Remove item off and len packing The key and value offsets and lengths were aggressively packed into the item structs in the segments. This saved a few bytes per item but didn't leave any room left for expansion without growing the item. We want to add a deletion item flag so let's just grow the item struct. It now has room for full precision offsets and lengths that we can access natively so we can get rid fo the packing and unpacking functions. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	685eb1f2dc	Fix segment block item alignemnt build bug The BUILD_BUG_ON() to test that the start of the items in the segment header is naturally aligned had a typo that masked the length instead of checking the remainder of division by the length. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	736d5765fc	Add a shrinker for the segment cache After segments have finished IO and while they're in the rbtree we track them with an LRU. Under memory pressure we can remove the oldest segments from the rbtree and free them. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	2bc1617280	Use contiguous key struct instead of kvecs Using kvecs for keys seemed like a good idea because there were a few uses that had keys in fragmented memory: dirent keys made up of an on-stack struct and the file name in the dentry, and keys straddling the pages that make up a cached segment. But it hasn't worked out very well. The code to perform ops on keys by iterating over vectors is pretty fiddly. And the raw kvecs only describe the actively referenced key, they know nothing about the total size of the buffer that the key resides in. Some ops can't check that they're not clobbering things, they're relying on callers not to mess up. And critically, the kvec iteration's become a bottleneck. It turns out that comparing keys is a very hot path in the item cache. All the code to initialize and iterate over two key vectors adds up when each high level fs operation is a few tree descents and each tree descent is a bunch of compares. So let's back off and have a specific struct for tracking keys that are stored in contiguous memory regions. Users ensure that keys are contiguous. The code ends up being a lot clearer, code now can see how big the full key buffer is, and the rbtree node comparison fast path is now just a memcmp. Almost all of the changes in the patch are mechanical semantic changes involving types, function names, args, and occasionaly slightly different return conventions. A slightly more involved change is that now dirent key users have to manage an allocated contiguous key with a copy of the path from the dentry. Item reading is now a little more clever about calculating the greatest range it can cache by initially walking all the segments instead of trying to do it as it runs out of items in each segment. The largest meaningful change is that now keys can't straddle page boundaries in memory which means they can't cross block boundaries in the segment. We align key offsets to the next block as we write keys to segments that would have straddled a block. We then also have to account for that padding when building segments. We add a helper that calculates if a given number of items will fit in a segment which is used by item dirtying, segment writing, and compaction. I left the tracepoint formatting for another patch. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	8a302609f2	Add some item cache/range counters Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	822ce205c5	Let compaction skip segments as needed Previously the only clever compaction avoidance we'd try was in the manifest walk. If we found that there were no overlapping segments in the next level we'd just move the entry down a level and skip compaction entirely. But that's just one specific instance of the general case: either of the lower or upper segments don't overlap with each other. There can be many lower level segments that intersect with the full range of keys in the upper level segment but which don't actually intersect with any items in the upper segment. So we refactor the compaction to notice this case. We get the first and last keys and use them to skip each segment as we first start to iterate through it. We don't want to read segments that we never actually have to copy items from so we read each segment on demand instead of concurrently as the compaction starts. This means that item iteration can now have to read a segment and can now return errors. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	519b9c35c4	Correcly wrap when finding compaction entries Compaction looks for the next entry at a given level to compact. It only tested for not finding a next entry when it needs to wrap the key and start over in the level, it missed the case where the next entry is at a greater level. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	930f541c7b	Add a scoutfs_seg_get Compaction is going to want to get additional references on a segment. It could just "read" it again while holding a reference but this is more clear. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	3407576ced	Don't use bio size in end_io Some drives don't set bi_size so just track the number of IOs. (And the size argument to end_io has been removed in recent kernels.) Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	963b04701f	Add some bio tracing Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	0a5fb7fd83	Add some counters Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	ded184b481	Add a pile of tracing printks Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	3f812fa9a7	More thoroughly integrate compaction The first pass at compaction just kicked a thread any time we added a segment that brought its level's count over the limit. Tasks could create dirty items and write level0 segments regardless of the progress of compaction. This ties the writing rate to compaction. Writers have to wait to hold a transaction until the dirty item count is under a segment and there's no level0 segments. Usualy more level0 segments are allowed but we're aggressively pushing compaction, we'll relax this later. This also more forcefully ensures that compaction makes forward progress. We kick the compaction thread if we exceed the level count, wait for level0 to drain, or successfully complete a compaction. We tweak scoutfs_manifest_next_compact() to return 0 if there's no compaction work to do so the the compaction thread can exit without triggering another. For clarity we also kick off a sync after compaction so that we don't sit around with a dirty manifest until the next sync. This may not be wise. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	aad5a34290	Don't prematurely write dirty super A previous refactoring messed up and had scoutfs_trans_write_func() always write the dirty super even when nothing was dirty and there was nothing for the sync attempt to do. This was very confusing and made it look like the segment and treap writes were being lost when in fact it was the super write that shouldn't have happened. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	a333c507fb	Fix how dirty treap is tracked The transaction writing thread tests if the manifest and alloc treaps are dirty. It did this by testing if there were any dirty nodes in the treap. But this misses the case where the treap has been modified and all nodes have been removed. In that case the root references no dirty nodes but needs to be written. Instead let's specifically mark the treap dirty when it's modified. From then on sync will always try to write it out. We also integrate updating the persistent root as part of writing the dirty nodes to the persistent ring. It's required and every caller did it so it was silly to make it a separate step. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	30b088377f	Fix setting trans_task Some recent refactoring accidentally set the trans task to null instead of the current task. It's not used but until it's removed it should be correct. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	c21dc4ec20	Refactor level_count and protect with seqcount We were manually manipulating the level counts in the super in a bunch of places under the manifest rwsem. This refactors them into simple get and add functions. We protect them with a seqcount so that we'll be able read them without blocking (from trans hold attempts). We also add a helper for testing that a level is full because we already used different comparisons in two call sites. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	2083793ae0	Add first pass at segment compaction This is the first draft of compaction which has the core mechanics. Add segment functions to free a segment's segno and to delete the entry that refers to the given segment. Add manifest functions that lock the manifest and dirty and delete manifest entries. These are used by the compaction thread to atomically modify the manfiest with the result of a compaction. Sort the level 0 entries in the manifest by their sequence. This lets compaction use the first oldest entry and reading can walk them backwards to get them in order and not have to sort. We also more carefully use the sequence field in the manifest search key to differentiate between finding high level entries that overlap and finding specific entries identified by their seq. Add some fields to the per-super compact_info struct which support compaction. We need to know the limit on the number of segments per level and we record keys per level which tell us which segment to use next time that level is compacted. We kick a compaction thread when we add a manifest entry and that brings the level count over the limit. scoutfs_manifest_next_compact() is the first meaty function. The compaction thread uses this to get all the segments involved in a compaction. It does a quick manifest update if the next manifest candidate doesn't overlap with any sgements in the next level. The compaction operation itself is a pretty straight forward read-modify-write operation. It asks the manifest to give it references to the segments it'll need, reads them in, iterates over them to count and copies items in order to output segments, and atomically updates the manifest. Now that the manifest can be dirty without any dirty segments we need to fix the transaction writing function's assumption that everything flows from dirty segments. It also has to now lock and unlock the manifest as it adds the entry for its level 0 segment. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	a45661e5b6	Add _prev version of treap lookup and iteration _lookup() and _lookup_next() each had nearly identical loops that took a dirty boolean. We combine them into one walker with flags for dirty and next and add a prev prev as well, giving us all the exported functions with combinations of the flags. We also add _last() to match _first() and _prev() to match _next(). Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	2522509ec8	Fix scoutfs_treap_next() parent walk comparision While walking up parents looking for the next node we were comparing the child with the wrong parent pointer. This is easily verified by glancing at rb_next() :). Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	80da7fefa7	fix treap deletion Treap deletion was pretty messed up. It forgot to reset parent and ref for the swapped node before using them to finally delete. And it didn't get all the weird cases right where the child node to swap is the direct child of the node. In that case we can't just swap the parent pointers and node pointers, they need to be special cased. So nuts to all that. We'll just rotate the node down until it doesn't have both children. They result in pretty similar patterns and the rotation mechanism is much simpler to understand. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	e5e7a25ecd	Don't use null node when repairing aug We were derefing the null parent when deleting a single node in a tree. There's no need to use parent_ref() here, we know that there's no node and we can just clear the root's aug bits. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	cd6cd000ce	Add ifdefed out quick treap printer This was pretty handy for debugging weird failure cases. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	eb94092f2f	Add kvec big endian inc and dec Add helpers that increment or decrement kvec vectors as theough they're big endian values. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	a15d37783e	Set read node parent through ref The code was using parent_ref() to set the parent ref's node pointer. But parent_ref() uses the parent's left node pointer to determine which ref points to the node. If we were setting the left it would return the right because the left isn't set yet. This messed up the tree shape and all hell broke loose. Just set it through the ref, we have it anyway. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	157b9294fa	Be sure not to overfill a segment with items The segment writing loop was assuming that the currently dirty items will fit in a segment. That's not true. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	be497f3fcf	Make sure to bubble the node aug bits up the treap We forgot to or in a node's children's augmentation bits when setting the augmentation bits up in the parent's ref. This stopped ring dirtying from finding all the dirty nodes in the treap. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	3333d89f82	Assign next seg seq from super We hadn't yet assigned real sequence numbers to the segments. Let's track the next sequence in the super block and assign it to segments as we write the first new item in each. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	db9f2be728	Switch to indexed manifest using treap ring The first pass manifest and allocator storage used a simple ring log that was entirely replayed into memory to be used. That risked the manifest being too large to fit in memory, especially with large keys and large volumes. So we move to using an indexed persistent structure that can be read on demand and cached. We use a treap of byte referenced nodoes stored in a circular ring. The code interface is modeled a bit on the in-memory rbtree interface. Except that we can get IO errors and manage allocation so we return data pointers to the item payload istead of item structs and we can return errors. The manifest and allocator are converted over and the old ring code is removed entirely. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	3dfe8e10df	Add little cmp u64 helper We have use of this little u64 comparison function in a few more places so let's make it a proper usable inline function in a header. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	b8d7e04262	Add negative caching of item ranges The item cache only knew about present items in the rbtree. Attempts to read items that didn't exist would always trigger expensive manfifest and segment searches. This reworks the item cache and item reading code to support the notion of cached ranges of keys. When we read items we also communicate the range of keys that we searched. This lets the cache return negative lookups for key values in the search that don't have items. The item cache gets an rbtree of key ranges. Each item lookup method now uses it to determine if a missing item needs to trigger a read. Item reading is now performed in batches instead of one at a time. This lets us specify the cache range along with the batch and apply them all atomically under the lock. The item range code is much more robust now that it has to track the range of keys that it searches. The read items call now takes a range. It knows to look for all level0 segments that interesect that range, not just the first key. The manifest segment references now include the min and max keys for the segment so we can use those to define the item search range. Since the refs now include keys we no longer have them as a dumb allocated array but instead have a list of alloced ref structs. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	cbb4282429	Add kvec key helpers Add a suite of simple kvec functions to work with kvecs that point to file system keys. The ones worth mentioning format keys into strings. They're used to add formatted strings for the keys to tracepoints. They're still a little rough but this is a functional first step. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	57a6ff087f	Add max key and max key size to format We're going to use these to support tracking cached item ranges. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	1b4bab3217	Fix ring read nr/part confusion Some parts of the ring reading were still using the old 'nr' for the number of blocks to read, but it's now the total number of blocks in the ring. Use part instead. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	c8d61c2e01	Manifest item reading tracked wrong key When iterating over items the manifest would always insert whatever values it found at the caller's key, instead of the key that it found in the segment. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	c74787a848	Harden, simplify, and shrink the kvecs The initial kvec code was a bit wobbly. It had raw loops, some weird constructs, and had more elements than we need. Add some iterator helpers that make it less likely that we'll screw up iterating over different length vectors. Get rid of reliance on a tailing null pointer and always use the count of elements to stop iterating. With that in place we can shrink the number of elements to just the greatest user. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	51a84447dd	Fix calculation of number of dirty segments The estimate for the number of dirty segment bytes was wildly over calculating the number of segment headers by confusing the length of the segment header with the length of segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	acee97ba2a	Fix tail ring entry zeroing The space calculation didn't include the terminating zero entry. That ensured that the space for the netry would never be consumed. But the remaining space was used to zero the end of the block so the final entry wasn't being zeroed. So have the space remaining include the terminating entry and factor that into the space consumption of each entry being appended. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	967e90e5ef	Fix kvec overlapping comparison The comparisons were a bit wrong when comparing overlaping kvec endpoints. We want to compare the starts and ends with the ends and starts, respectively. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	b251f91842	Fix updating parent dirty bits We were trying to propagate dirty bits from a node itself when its dirty bit is set. But it's bits are consistent so it stops immediately. We need to propagate from the parent of the node that changed. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	b7b43de8c7	Queue trans work in our work queue We went to the trouble of allocating a work queue with one work in flight but then didn't use it. We could have concurrent trans write func execution. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	a5cac107a1	Set END_IO on allocated segs A reader that hits an allocated segment would wait on IO forever. Setting the end_io bit lets readers use written segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	f9ca1885f9	Specify ring blocks with index,nr Specifying the ring blocks with a head and tail index lead to pretty confusing code to figure out how many blocks to read and if we had passed the tail. Instead specify the ring with a starting index and number of blocks. The code to read and write the ring blocks naturally falls out and is a lot more clear. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00
Zach Brown	b598bf855d	Fix ring appending and writing The ring appending next entry header cursor assignment was pointing past the caller's src header, not at the next header to write to in the block. The writing block index and blkno calculations were just bad. Pretend they never happened. And finally we need to point the dirty super at the ring index for the commit and we need to reset the append state for the next commit. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:53 -07:00

1 2 3 4 5

214 Commits