scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-06 12:06:26 +00:00

Author	SHA1	Message	Date
Zach Brown	f7f7a2e53f	scoutfs: add scoutfs_block_zero_from() We already have a function that zeros the end of a block starting at a given offset. Some callers have a pointer to the byte to zero from so let's add a convenience function that calculates the offset from the pointer. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-28 13:42:27 -07:00
Zach Brown	cf0199da00	scoutfs: allow more concurrent btree locking The btree locking so far was a quick interim measure to get the rest of the system going. We want to clean it up both for correctness and performance but also to make way for using the btree for block allocation. We were unconditionally using the buffer head lock for tree block locking. This is bad for at least four reasons: it's invisible to lockdep, it doesn't allow concurrent reads, it doesn't allow reading while a block is being written during the transaction, and it's not necessary at all when the for stable read-only blocks. Instead we add a rwsem to the buffer head private which we use to lock the block when it's writable. We clean up the locking functions to make it clearer that btree_walk holds one lock at a time and either returns it to the caller with the buffer head or unlocks the parent if its returning an error. We also add the missing sibling block locking during splits and merges. Locking the parent prevented walks from descending down our path but it didn't protect against previous walks that were already down at our sibling's level. Getting all this working with lockdep adds a bit more class/subclass plumbing calls but nothing too ornerous. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 13:41:18 -07:00
Zach Brown	bb3a5742f4	scoutfs: drop sib bh ref in split We forgot to drop the sibling bh reference while splitting. Oopsie! Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	84f23296fd	scoutfs: remove btree cursor The btree cursor was built to address two problems. First it accelerates iteration by avoiding full descents down the tree by holding on to leaf blocks. Second it lets callers reference item value contents directly to avoid copies. But it also has serious complexity costs. It pushes refcounting and locking out to the caller. There have already been a few bugs where callers did things while holding the cursor without realizing that they're holding a btree lock and can't perform certain btree operations or even copies to user space. Future changes to the allocator to use the btree motivates cleaning up the tree locking which is complicated by the cursor being a stand alone lock reference. Instead of continuing to layer complexity onto this construct let's remove it. The iteration acceleration will be addressed the same way we're going to accelerate the other btree operations: with per-cpu cached leaf block references. Unlike the cursor this doesn't push interface changes out to callers who want repeated btree calls to perform well. We'll leave the value copying for now. If it becomes an issue we can add variants that call a function to operate on the value. Let's hope we don't have to go there. This change replaces the cursor with a vector to memory that the value should be copied to and from. The vector has a fixed number of elements and is wrapped in a struct for easy declaration and initialization. This change to the interface looks noisy but each caller's change is pretty mechanical. They tend to involve: - replace the cursor with the value struct and initialization - allocate some memory to copy the value in to - reading functions return the number of value bytes copied - verify copied bytes makes sense for item being read - getting rid of confusing ((ret = _next())) looping - _next now returns -ENOENT instead of 0 for no next item - _next iterators now need to increase the key themselves - make sure to free allocated mem Sometimes the order of operations changes significantly. Now that we can't modify in place we need to read, modify, write. This looks like changing a modification of the item through the cursor to a lookup/update pattern. The symlink item iterators didn't need to use next because they walk a contiguous set of keys. They're changed to use simple insert or lookup. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	a9afa92482	scoutfs: correctly set the last symlink item The final symlink item insertion was taking the min of the entire path and the max symlink item size, not the min of the remaining length of the path after having created all the previous items. For paths larger than the max item size this could use too much space. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	10a42724a9	scoutfs: add scoutfs_dec_key() This is analagous to scoutfs_inc_key(). It decreases the next highest order key value each time its decrement wraps. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	161063c8d6	scoutfs: remove very noisy bh ref tracing This wasn't adding much value and was exceptionally noisy. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	2bed78c269	scoutfs: specify btree root The btree functions currently don't take a specific root argument. They assume, deep down in btree_walk, that there's only one btree in the system. We're going to be adding a few more to support richer allocation. To prepare for this we have the btree functions take an explicit btree argument. This should make no functional difference. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	d2a696f4bd	scoutfs: add zero key set and test functions Add some quick functions to set a key to all zeros and to test if a key is all zeros. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	3bb0c80686	scoutfs: fix buddy stable bit test The buddy allocator had the test for non-existant stable bitmap blocks backwards. An uninitialized block implies that all the bits are marked free and we don't need to test that the specific bits are free. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:02:19 -07:00
Zach Brown	1dd4a14d04	scoutfs: don't dereference IS_ERR buffer_head The check for aligned buffer head data pointers was trying to dereference a bad IS_ERR pointer when allocation of a new block failed with ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 15:36:25 -07:00
Zach Brown	49c3d5ed34	scoutfs: add btree block verification Add a function to verify that a btree block is valid. It's disabled for now because it's expensive. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:49:37 -07:00
Zach Brown	f44306757c	scoutfs: add btree deletion trace message Add a simple trace message with the result of item deletion calls. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:43:13 -07:00
Zach Brown	b55da5ecb7	scoutfs: compact btree more carefully when merging The btree block merging code knew to try and compact the destination block if it was going to move more bytes worth of items than there was contiguous free space in the destination block. But it missed the case where item movement moves more than the hint because the last item it moves was big. In the worst case this creates an item which overlaps the item offsets and ends up looking like corrupt items. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:36:35 -07:00
Zach Brown	164bcb5d99	scoutfs: bug if btree item creation corrupts Add a BUG_ON() assertion for the case where we create an item that starts in the item offset array. This happens if the callers free space calculations are incorrect. It shouldn't be triggerable by corrupt blocks if we're verifying the blocks as we read them in. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:28:22 -07:00
Zach Brown	5375ed5f38	scoutfs: fill nameidata with symlink path Our follow_link method forgot to fill the nameidata with the target path of the symlink. The uninitialized nameidata tripped up the generic readlink code in a debugging kernel. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 13:47:19 -07:00
Zach Brown	04e0df4f36	scoutfs: forgot to initialize file alloc lock Thank goodness for lockdep! Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 13:38:58 -07:00
Zach Brown	b2e12a9f27	scoutfs: sync large transactions as released We don't want very large transactions to build up and create huge commit latencies. All blocks are written to free space so we use a count of allocations to count dirty blocks. We arbitrarily limit the transaction to 128MB and try to kick off commits when we release transactions that have gotten that big. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-06 15:16:50 -07:00
Zach Brown	06c718e16a	scoutfs: remove unlinked inode items Wire up the inode callbacks that let us remove all the persistent items associated with an unlinked inode as its final reference is dropped. This is the first part of full truncate and orphan inode support. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-31 09:31:23 -07:00
Zach Brown	64b82e1ac3	scoutfs: add symlink support Symlinks are easily implemented by storing the target path in btree items. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-29 10:21:27 -07:00
Zach Brown	df93073971	scoutfs: don't unlock err bh after validation If block validation failed then we'd end up trying to unlock an IS_ERR buffer_head pointer. Fix it so that we drop the ref and set the pointer after unlocking. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-26 16:51:47 -07:00
Zach Brown	cb318982c9	scoutfs: add support for statfs To do a credible job of this we need to track the number of free blocks. We add counters of order allocations free to the indirect blocks so that we can quickly scan them. We also need a bit of help to count inodes. Finally I noticed that we were miscalculating the number of slots in the indirect blocks because we were using the size of the buddy block header, not the size of the indirect block header. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-24 15:52:54 -07:00
Zach Brown	c90710d26b	scoutfs: add find xattr ioctls Add ioctls that return the inode numbers that probably contain the given xattr name or value. To support these we add items that index inodes by the presence of xattr items whose names or values hash to a give hash value. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-23 12:14:55 -07:00
Zach Brown	634114f364	scoutfs: update CKF key format The previous %llu for the key type came from the weird tracing functions that cast all the arguments to long long. Those have since been removed. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-23 12:05:59 -07:00
Zach Brown	6c12e7c38b	scoutfs: add hard link support Now that we have the link backrefs let's add support for hard links so we can verify that an inode can have multiple backrefs. (It can.) It's a straight forward refactoring of mknod to let callers either allocate or use existing inodes. We push all the btree item specific work into a function called by mknod and link. The only surprising bit is the small max link count. It's limiting the worst case buffer size for the inode_paths ioctl. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-17 16:22:00 -07:00
Zach Brown	0991622a21	scoutfs: add inode_paths ioctl This adds the ioctl that returns all the paths from the root to a given inode. The implementation only traverses btree items to keep it isolated from the vfs object locking and life cycles, but that could be a performance problem. This is another motivation to accelerate the btree code. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-11 16:46:18 -07:00
Zach Brown	77e0ffb981	scoutfs: track data blocks in bmap items Up to this point we'd been storing file data in large fixed size items. This obviously needed to change to get decent large file IO patterns. This wires the file IO into the usual page cache and buffer head paths so that we write data blocks into allocations referenced by btree items. We're aggressively trying to find the highest ratio of performance to implementation complexity. Writing dirty metadata blocks during transaction commit changes a bit. We need to discover if we have dirty blocks before trying to sync the inodes. We add our _block_has_dirty() function back and use it to avoid write attempts during transaction commit. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-10 15:18:45 -07:00
Zach Brown	198ec2ed5b	scoutfs: have btree_update return errors We can certainly have btree update callers that haven't yet dirtied the blocks but who can deal with errors. So make it return errors and have its only current caller freak out if it fails. This will let the file data block mapping code attempt to get a dirty item without first dirtying. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-09 17:03:30 -07:00
Zach Brown	8a6715ff02	scoutfs: add buddy was_free and free_extent Add helpers to discover if a given allocation was free and to free all the buddy order allocations that make up an abritrary block extent. These are going to be used by the file data block mapping code. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-09 16:56:27 -07:00
Zach Brown	1fde47170b	scoutfs: simplify btree block format Now that we are using fixed smaller blocks we can make the btree format significantly simpler. The fixed small block size limits the number of items that will be stored in each block. We can use a simple sorted array of item offsets to maintain the item sort order instead of the treap. Getting rid of the treap not only removes a bunch of code, it makes tasks like verifying or repairing a btree block a lot simpler. The main impact on the code is that now an item doesn't record its position in the sort order. Users of sorted item position now need to track an items sorted position instead of just the item. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-02 13:28:08 -07:00
Zach Brown	8bc2b15e3d	scoutfs: remove scoutfs_buddy_dirty The buffer head rewrite got rid of the only caller who needed to ensure that a free couldn't fail. Let's get rid of this. We can always bring it back if it's needed again. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-02 13:28:08 -07:00
Zach Brown	7b18bce2e2	scoutfs: use buffer heads Now that we have a fixed small block size we don't need our own code for tracking contiguous memory for blocks that are larger than the page size. We can use buffer heads which support block sizes smaller than the page size. Our block API remains to enforce transactions, cheksumming, cow, and eventually invalidating and retrying reads of stale bloks. We set the logical blocksize of the bdev buffer cache to our fixed block size. We use a private bh state bit to indicate that the contents of a block have had their checksum verified. We use a small structure stored at b_private to track dirty blocks so that we can control when they're written. The btree block traversal code uses the buffer_head lock to serialize access to btree block contents now that the block rwsem has gone away. This isn't great but works for now. Not being able to relocate blocks in the buffer cache (really fragments of pages in the bdev page cache.. blkno determines memory location) means that the cow path always has to copy. Callers are easily translated: use struct buffer_head instead of scoutfs_block and use a little helper instead of dereferencing ->data directly. I took the opportunity to clean up some of the inconsistent block function names. Now more of the functions follow the scoutfs_block_*() pattern. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-02 13:28:08 -07:00
Zach Brown	f024c70802	scoutfs: decrease block size File data extent tracking can get very complicated if we have to worry about page sized writes that are less than the block size. We can avoid all that complexity if we define the block size to be the smallest possible page size. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-02 13:27:14 -07:00
Zach Brown	0e017ff0dc	scoutfs: free btree unused btree blocks The btree code wasn't freeing blocks either when it had removed references to them or when an operation fails after having allocated a new block. Now that the allocator is more capable we can add in these free calls. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-27 16:40:22 -07:00
Zach Brown	ad34f40744	scoutfs: free source blkno after cow As we update references to point to newly allocated dirty blocks in a transaction we need to free the old referenced blknos. By using a two-phase dirty/free interface we can avoid freeing failing after we've made it through stages of the cow processing which can't be easily undone. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-27 16:13:37 -07:00
Zach Brown	dcef9c0ada	scoutfs: store the buddy allocator in a radix The current implementation of the allocator was built for for a world where blocks were much, much, larger. It could get away with keeping the entire bitmap resident and having to read it its entirety before being able to use it for the first time. That will not work in the current architecture that's built around a smaller metadata block size. The raw size of the allocator gets large enough that all of those behaviours become problematic at scale. This shifts the buddy allocator to be stored in a radix of blocks instead of in a ring log. This brings it more in line with the structure of the btree item indexes. It can be initially read, cached, and invalidated at block granularity. In addition, it cleverly uses the cow block structures to solve the unreferenced space allocation constraint that the previous allocator hadn't. It can compare the dirty and stable blocks to discover free blocks that aren't referenced by the old stable state. The old allocator would have grown a bunch of extra special complexity to address this. There's still work to be done but this is a solid start. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-27 12:09:29 -07:00
Zach Brown	e226927174	scoutfs: add support for cowing blocks The current block interface for allocating a dirty copy of a given stable block didn't cow. It moved the existing stable block into its new dirty location. This is fine for the btree which will never reference old stable blocks. It's not optimal for the allocator which is going to want to combine the previous stable allocator blocks with the current dirty allocator blocks to determine which free regions can satisfy allocations. If we invalidate the old stable cached copy it'll immediately read it back in. And it turns out that it was a little buggy in how it moved the stable block to its new dirty location. It didn't remove any old blocks at the new blkno. So we offer two high level interfaces for either moving or copying the contents of the dirty block. And we're sure to always invalidate old cached blocks at the new dirty blkno location. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-27 11:30:53 -07:00
Zach Brown	90a73506c1	scoutfs: remove homebrew tracing Oh, thank goodness. It turns out that there's a crash extension for working with tracepoints in crash dumps. Let's use standard tracepoints and pretend this tracing hack never happened. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-20 12:08:12 -07:00
Zach Brown	b51511466a	scoutfs: add inodes_since ioctl Add the ioctl that let's us find out about inodes that have changed since a given sequence number. A sequence number is added to the btree items so that we can track the tree update that it last changed in. We update this as we modify items and maintain it across item copying for splits and merges. The big change is using the parent item ref and item sequence numbers to guide iteration over items in the tree. The easier change is to have the current iteration skip over items whose sequence number is too old. The more subtle change has to do with how iteration is terminated. The current termination could stop when it doesn't find an item because that could only happen at the final leaf. When we're ignoring items with old seqs this can happen at the end of any leaf. So we change iteration to keep advancing through leaf blocks until it crosses the last key value. We add an argument to btree walking which communicates the next key that can be used to continue iterating from the next leaf block. This works for the normal walk case as well as the seq walking case where walking terminates prematurely in an interior node full of parent items with old seqs. Now that we're more robustly advancing iteration with btree walk calls and the next key we can get rid fo the 'next_leaf' hack which was trying to do the same thing inside the btree walk code. It wasn't right for the seq walking case and was pretty fiddly. The next_key increment could wrap the maximal key at the right spine of the tree so we have _inc saturate instead of wrap. And finally, we want these inode scans to not have to skip over all the other items associated with each inode as it walks looking for inodes with the given sequence number. We change the item sort order to first sort by type instead of by inode. We've wanted this more generally to isolate item types that have different access patterns. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:46:20 -07:00
Zach Brown	3efec0c094	scoutfs: add scoutfs_set_max_key() It's nice to have a helper that sets the max possible key instead of messing around with memset or ~0 manually. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:24:10 -07:00
Zach Brown	ae748f0ebc	scoutfs: allow tracing with a null sb The sb counter field isn't necessary, allow a null sb pointer arg which then results in a counter output of 0. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:20:19 -07:00
Zach Brown	59b1f62df8	scoutfs: add basic xattr support Add basic support for extended attributes. The next steps are to add support for more prefixes, including ACLs, and to properly delete them on unlink. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:59:43 -07:00
Zach Brown	cedeacacb8	scoutfs: add file with simple name functions Directory entries and extended attributes similarly hash and compare strings so we'll give them some shared functions for doing so. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:49:41 -07:00
Zach Brown	a64ca8018a	scoutfs: add scoutfs_btree_hole() for finding keys Directory entries found a hole in the key range between the first and last possible hash value for a new entry's key. The xattrs want to do the same thing so let's extract this into a proper function. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:45:17 -07:00
Zach Brown	5c7ba5ed39	scoutfs: remove wrlock and roster These were interesting experiments in how to manage locks across the cluster but we'll be going in a more flexible direction. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-01 21:03:40 -07:00
Zach Brown	4689bf0881	scoutfs: free once granted wrlock entries The entry free routine only frees entries that don't have any references from its context. Callers are supposed to try to free entries after removing references to them. Callers that were removing entries from a shard's granted pointer were trying to free the entry before removing the pointer to the entry. Entries that were last removed from shard granted pointers were never freed. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 21:17:30 -07:00
Zach Brown	171aea62cd	scoutfs: add some wrlock tracing Add a bunch of tracing to the wrlock code paths. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 21:10:16 -07:00
Zach Brown	c9caebc117	scoutfs: remove unused held_trans The held lock struct had an unused 'held_trans' field from a previous version of the code that specifically tried to track if a held lock had the trans open. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 21:07:05 -07:00
Zach Brown	ad5a58c348	scoutfs: make trace format a little nicer The first trace format was pretty noisy. Now the time is printed in a gettimeofday timeval so that it can be correlated with other time stamps. The super block gets a counter instead of a pointer. The pid and cpu are printed without a lavel and we add the line number so that it's easy to grep the source to find a trace caller. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 20:34:31 -07:00
Zach Brown	7d6dd91a24	scoutfs: add tracing messages This adds tracing functionality that's cheap and easy to use. By constantly gathering traces we'll always have rich history to analyze when something goes wrong. Signed-off-by: Zach Brown <zab@versity.com>	2016-05-28 11:11:15 -07:00

1 2 3

102 Commits