scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 04:55:21 +00:00

Author	SHA1	Message	Date
Zach Brown	d4571b6db3	Add scoutfs_block_forget() Add scoutfs_block_forget() which ensures that a block won't satisfy future lookups and will not be written out. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:07 -08:00
Zach Brown	f57c07381a	Go back to having our own scoutfs_block cache We used to have 16k blocks in our own radix_tree cache. When we introduced the simple file block mapping code it preferred to have block size == page size. That let us remove a bunch of code and reuse all the kernel's buffer head code. But it turns out that the buffer heads are just a bit too inflexible. We'd like to have blocks larger than page size, obviously, but it turns out there's real functional differences. Resolving the problem of unlocked readers and allocating writers working with the same blkno is the most powerful example of this. It's trivial to fix by always inserting new allocated cached blocks in the cache. But solving it with buffer heads requires expensive and risky locking around the buffer head cache which can only support a single physical instance of a given blkno because there can be multiple blocks per page. So this restores the simple block cache that was removed back in commit 'c8e76e2 scoutfs: use buffer heads'. There's still work to do to get this fully functional but it's worth it. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:07 -08:00
Zach Brown	4042927519	Make btree nr_items le16 If we increase the block size the btree is going to need to be able to store more than 255 items in a btree block. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:07 -08:00
Zach Brown	03787f23d3	Add scoutfs_block_data_from_contents() The btree code needs to get a pointer to a whole block from just pointers to elements that it's sorting. It had some manual code that assumed details of the blocks. Let's give it a real block interface to do what it wants and make it the block API's problem to figure out how to do it. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:44:54 -08:00
Zach Brown	3d66a4b3dd	Block API offers scoutfs_block instead of bh Our thin block wrappers exposed buffer heads to all the callers. We're about to revert back to the block interface that uses its own scoutfs_block struct instead of buffer heads. Let's reduce the churn of that patch by first having the block API give callers an opaque struct scoutfs_block. Internally it's still buffer heads but the callers don't know that. scoutfs_write_dirty_super() is the exception who has magical knowledge of buffer heads. That's fixed once the new block API offers a function for writing a single block. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:44:40 -08:00
Zach Brown	c65b70f2aa	Use full radix for buddy and record first set The first pass of the buddy allocator had a fixed indirect block so it couldn't address large devices. It didn't index set bits or slots for each order so we spent a lot of cpu searching for free space. And it didn't precisely account for stable free space so it could spend a lot of cpu time discovering that free space can't be used because it wasn't stable. This fixes these initial critical flaws in the buddy allocator. Before it could only address a few hundred megs and now it can address 2^64 blocks. Before it limited bulk inode creation searching for slots and leaf bits and now other components are much higher in the profiles with greater create rates. First we remove the special case single indirect block. The root now references a block that can be at any height. The root records the height and each block records its level. We descend until we hit the leaf. We add a stack of the blocks traversed so that we can ascend and fix up parent indexing after we modify a leaf. Now that we can have quite a lot of parent indirect blocks we can no longer have a static bitmap for allocating buddy blocks. We instead precisely preallocate two blocks for every buddy block that will be used to address all the device blocks. The blkno offset of these pairs of buddy blocks can be calculated for a given position in the tree. Allocating a blkno xors the low bit of the blkno and freeing is a nop. This happily gets rid of the specific allocation of buddy blocks with its regions and worrying about stable free blocks itself. Then we index the first set index in a block for each order. In parent blocks this tells you the slot you can traverse to find a free region of that order. In leaf blocks it tells you the specific block offset of the first free extent. This is kept up to date as we set and clear buddy bits in leaves and free_order bits in parent slots. Allocation now is a simple matter of block reads and array dereferencing. And we now precisely account for frees that should not satisfy allocation until after a transaction commit. We record frees of stable data in extent nodes in an rbtree after their buddy blocks have been dirtied. Because their blocks are dirtied we can free them as the transaction commits without errors. Similarly, we can also revert them if the transaction commit fails so that they don't satisfy allocation. This prevents us from having to hang or go read-only if a transaction commit fails. The two changes visible to callers are easy argument changes: scoutfs_buddy_free() now takes a seq to specify when the allocation was first allocated, and scoutfs_buddy_alloc_same() has its arguments match that it only makes sense for single block allocations. Unfortunately all these changes are interrelated so the resulting patch amounts to a rewrite. The core buddy bitmap helper functions and loops are the same but the surrounding block container code changes significnatly. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Zach Brown	17ec4a1480	Add seq field to block map item The file block mapping code needs to know if an existing block mapping is dirty in the current transaction or not. It was doing that by calling in to the allocator. Instead of calling in to the allocator we can instead store the seq of the block in the mapping item. We also probably want to know the seq of data blocks to make it possible to discover regions of files that have changed since a previous seq. This does increase the size of the block mapping item but they're not long for this world. We're going to replace them with proper extent items in the near future. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Zach Brown	c8d1703196	Add blkno and level to bad btree printk Add the blkno and level to the output for a btree block that fails verification. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Zach Brown	44588c1d8b	Lock btree merges The btree locking wasn't covering the merge candidate block before the siblings were locked. In that unlocked code it could compact the block corrupting it for whatever other tree walk might only have the merge candidate locked after having unlocked the parent. This extends locking coverage to merge and split attempts by acquiring the block lock immediately after we read it. Split doesn't have to lock its destination block but it does have to know to unlock the block on errors. Merge has to more carefully lock both of its existing blocks in a consistent order. To clearly implement this we simplify the locking helpers to just unlock and lock a given block, falling back to the btree rwsem if there isn't a block. I started down this road while chasing allocator bugs that manifested as tree corruption. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Zach Brown	a77f88386c	Add scoutfs_btree_prev() We haven't yet had a pressing need for a search for the previous item before a given key value. File extent items offer the first strong candidate. We'd like to naturally store the start of the extent in the key so to find an extent that overlaps a block we'd like to find the previous key before the search block offset. The _prev search is easy enough to implement. We have to update tree walking to update the prev key and update leaf block processing to find the correct item position after the binary search. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Zach Brown	f32365321d	Remove unused btree internal WALK_NEXT In the past the WALK_NEXT enum was used to tell the walking core that the caller was iterating and that they'd need to advance to sibling blocks if their key landed off the end of a leaf. In the current code that's now handled by giving the walk caller a next_key which will continue the search from the next leaf. WALK_NEXT is unused and we can remove it. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:37 -08:00
Zach Brown	1cbd84eece	scoutfs: wire up sop->dirty_inode We're using the generic block buffer_head write_begin and write_end functions. They call sop->dirty_inode() to update the inode i_size. We didn't have that method wired up so updates to the inode in the write path wasn't dirtying the inode item. Lost i_size updates would trivially lose data but we first noticed this when looking at inode item sequence numbers while overwriting. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:36 -08:00
Zach Brown	165d833c46	Walk stable trees in _since ioctls The _since ioctls walk btrees and return items that are newer than a given sequence number. The intended behaviour is that items will appear in a greater sequence number if they change after appearing in the queries. This promise doesn't hold for items that are being modified in the current transaction. The caller would have to always ask for seq X + 1 after seeing seq X to make sure it got all the changes that happened in seq X while it was the current dirty transaction. This is fixed by having the interfaces walk the stable btrees from the previous transaction. The results will always be a little stale but userspace already has to deal with stale results because it can't lock out change, and it can use sync (and a commit age tunable we'll add) to limit how stale the results can be. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:36 -08:00
Mark Fasheh	2fc1b99698	scoutfs: replace some open coded corruption checks We can trivially do the simple check of value length against what the caller expects in btree.c. Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Zach Brown <zab@versity.com>	2016-10-27 17:25:05 -05:00
Mark Fasheh	ebbb2e842e	scoutfs: implement inode orphaning This is pretty straight forward - we define a new item type, SCOUTFS_ORPHAN_KEY. We don't need to store any value with this, the inode and type fields are enough for us to find what inode has been orphaned. Otherwise this works as one would expect. Unlink sets the item, and ->evict_inode removes it. On mount, we scan for orphan items and remove any corresponding inodes. Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Zach Brown <zab@versity.com>	2016-10-24 16:41:45 -05:00
Nic Henke	d4355dd587	Add all target for make Adding in an 'all' target allows us to use canned build scripts for any of the scoutfs related repositories. Signed-off-by: Nic Henke <nic.henke@versity.com> Signed-off-by: Zach Brown <zab@zabbo.net>	2016-10-20 13:55:31 -07:00
Nic Henke	ad2f5b33ee	Use make variable CURDIR instead of PWD When running make in a limited shell or in docker, there is no PWD from shell. By using CURDIR we avoid worrying about the environment and let make take care of this for us. Signed-off-by: Nic Henke <nic.henke@versity.com> Signed-off-by: Zach Brown <zab@zabbo.net>	2016-10-20 13:55:26 -07:00
Zach Brown	16e94f6b7c	Search for file data that has changed We don't overwrite existing data. Every file data write has to allocate new blocks and update block mapping items. We can search for inodes whose data has changed by filtering block mapping item walks by the sequence number. We do this by using the exact same code for finding changed inodes but using the block mapping key type. Signed-off-by: Zach Brown <zab@versity.com>	2016-10-20 13:55:14 -07:00
Mark Fasheh	5b7f9ddbe2	Trace scoutfs btree functions We make an event class for the two most common btree op patterns, and reuse that to make our tracepoints for each function. This covers all the entry points listed in btree.h. We don't get every single parameter of every function but this is enough that we can see which keys are being queried / inserted. Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Zach Brown <zab@versity.com>	2016-10-13 14:08:08 -07:00
Mark Fasheh	31d182e2db	Add 'make clean' target Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Zach Brown <zab@versity.com>	2016-10-13 13:52:34 -07:00
Zach Brown	5601f8cef5	scoutfs: add scoutfs_block_forget() The upcoming allocator changes have a need to forget dirty blocks so they're not written. It proabably won't be the only one. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-28 13:46:18 -07:00
Zach Brown	9d08b34791	scoutfs: remove excessive block locking tracing I accidentally left some lock tracing in the btree locking commit that is very noisy and not particularly useful. Let's remove it. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-28 13:44:31 -07:00
Zach Brown	f7f7a2e53f	scoutfs: add scoutfs_block_zero_from() We already have a function that zeros the end of a block starting at a given offset. Some callers have a pointer to the byte to zero from so let's add a convenience function that calculates the offset from the pointer. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-28 13:42:27 -07:00
Zach Brown	cf0199da00	scoutfs: allow more concurrent btree locking The btree locking so far was a quick interim measure to get the rest of the system going. We want to clean it up both for correctness and performance but also to make way for using the btree for block allocation. We were unconditionally using the buffer head lock for tree block locking. This is bad for at least four reasons: it's invisible to lockdep, it doesn't allow concurrent reads, it doesn't allow reading while a block is being written during the transaction, and it's not necessary at all when the for stable read-only blocks. Instead we add a rwsem to the buffer head private which we use to lock the block when it's writable. We clean up the locking functions to make it clearer that btree_walk holds one lock at a time and either returns it to the caller with the buffer head or unlocks the parent if its returning an error. We also add the missing sibling block locking during splits and merges. Locking the parent prevented walks from descending down our path but it didn't protect against previous walks that were already down at our sibling's level. Getting all this working with lockdep adds a bit more class/subclass plumbing calls but nothing too ornerous. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 13:41:18 -07:00
Zach Brown	bb3a5742f4	scoutfs: drop sib bh ref in split We forgot to drop the sibling bh reference while splitting. Oopsie! Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	84f23296fd	scoutfs: remove btree cursor The btree cursor was built to address two problems. First it accelerates iteration by avoiding full descents down the tree by holding on to leaf blocks. Second it lets callers reference item value contents directly to avoid copies. But it also has serious complexity costs. It pushes refcounting and locking out to the caller. There have already been a few bugs where callers did things while holding the cursor without realizing that they're holding a btree lock and can't perform certain btree operations or even copies to user space. Future changes to the allocator to use the btree motivates cleaning up the tree locking which is complicated by the cursor being a stand alone lock reference. Instead of continuing to layer complexity onto this construct let's remove it. The iteration acceleration will be addressed the same way we're going to accelerate the other btree operations: with per-cpu cached leaf block references. Unlike the cursor this doesn't push interface changes out to callers who want repeated btree calls to perform well. We'll leave the value copying for now. If it becomes an issue we can add variants that call a function to operate on the value. Let's hope we don't have to go there. This change replaces the cursor with a vector to memory that the value should be copied to and from. The vector has a fixed number of elements and is wrapped in a struct for easy declaration and initialization. This change to the interface looks noisy but each caller's change is pretty mechanical. They tend to involve: - replace the cursor with the value struct and initialization - allocate some memory to copy the value in to - reading functions return the number of value bytes copied - verify copied bytes makes sense for item being read - getting rid of confusing ((ret = _next())) looping - _next now returns -ENOENT instead of 0 for no next item - _next iterators now need to increase the key themselves - make sure to free allocated mem Sometimes the order of operations changes significantly. Now that we can't modify in place we need to read, modify, write. This looks like changing a modification of the item through the cursor to a lookup/update pattern. The symlink item iterators didn't need to use next because they walk a contiguous set of keys. They're changed to use simple insert or lookup. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	a9afa92482	scoutfs: correctly set the last symlink item The final symlink item insertion was taking the min of the entire path and the max symlink item size, not the min of the remaining length of the path after having created all the previous items. For paths larger than the max item size this could use too much space. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	10a42724a9	scoutfs: add scoutfs_dec_key() This is analagous to scoutfs_inc_key(). It decreases the next highest order key value each time its decrement wraps. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	161063c8d6	scoutfs: remove very noisy bh ref tracing This wasn't adding much value and was exceptionally noisy. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	2bed78c269	scoutfs: specify btree root The btree functions currently don't take a specific root argument. They assume, deep down in btree_walk, that there's only one btree in the system. We're going to be adding a few more to support richer allocation. To prepare for this we have the btree functions take an explicit btree argument. This should make no functional difference. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	d2a696f4bd	scoutfs: add zero key set and test functions Add some quick functions to set a key to all zeros and to test if a key is all zeros. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	3bb0c80686	scoutfs: fix buddy stable bit test The buddy allocator had the test for non-existant stable bitmap blocks backwards. An uninitialized block implies that all the bits are marked free and we don't need to test that the specific bits are free. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:02:19 -07:00
Zach Brown	1dd4a14d04	scoutfs: don't dereference IS_ERR buffer_head The check for aligned buffer head data pointers was trying to dereference a bad IS_ERR pointer when allocation of a new block failed with ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 15:36:25 -07:00
Zach Brown	49c3d5ed34	scoutfs: add btree block verification Add a function to verify that a btree block is valid. It's disabled for now because it's expensive. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:49:37 -07:00
Zach Brown	f44306757c	scoutfs: add btree deletion trace message Add a simple trace message with the result of item deletion calls. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:43:13 -07:00
Zach Brown	b55da5ecb7	scoutfs: compact btree more carefully when merging The btree block merging code knew to try and compact the destination block if it was going to move more bytes worth of items than there was contiguous free space in the destination block. But it missed the case where item movement moves more than the hint because the last item it moves was big. In the worst case this creates an item which overlaps the item offsets and ends up looking like corrupt items. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:36:35 -07:00
Zach Brown	164bcb5d99	scoutfs: bug if btree item creation corrupts Add a BUG_ON() assertion for the case where we create an item that starts in the item offset array. This happens if the callers free space calculations are incorrect. It shouldn't be triggerable by corrupt blocks if we're verifying the blocks as we read them in. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 14:28:22 -07:00
Zach Brown	5375ed5f38	scoutfs: fill nameidata with symlink path Our follow_link method forgot to fill the nameidata with the target path of the symlink. The uninitialized nameidata tripped up the generic readlink code in a debugging kernel. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 13:47:19 -07:00
Zach Brown	04e0df4f36	scoutfs: forgot to initialize file alloc lock Thank goodness for lockdep! Signed-off-by: Zach Brown <zab@versity.com>	2016-09-08 13:38:58 -07:00
Zach Brown	b2e12a9f27	scoutfs: sync large transactions as released We don't want very large transactions to build up and create huge commit latencies. All blocks are written to free space so we use a count of allocations to count dirty blocks. We arbitrarily limit the transaction to 128MB and try to kick off commits when we release transactions that have gotten that big. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-06 15:16:50 -07:00
Zach Brown	06c718e16a	scoutfs: remove unlinked inode items Wire up the inode callbacks that let us remove all the persistent items associated with an unlinked inode as its final reference is dropped. This is the first part of full truncate and orphan inode support. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-31 09:31:23 -07:00
Zach Brown	64b82e1ac3	scoutfs: add symlink support Symlinks are easily implemented by storing the target path in btree items. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-29 10:21:27 -07:00
Zach Brown	df93073971	scoutfs: don't unlock err bh after validation If block validation failed then we'd end up trying to unlock an IS_ERR buffer_head pointer. Fix it so that we drop the ref and set the pointer after unlocking. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-26 16:51:47 -07:00
Zach Brown	cb318982c9	scoutfs: add support for statfs To do a credible job of this we need to track the number of free blocks. We add counters of order allocations free to the indirect blocks so that we can quickly scan them. We also need a bit of help to count inodes. Finally I noticed that we were miscalculating the number of slots in the indirect blocks because we were using the size of the buddy block header, not the size of the indirect block header. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-24 15:52:54 -07:00
Zach Brown	c90710d26b	scoutfs: add find xattr ioctls Add ioctls that return the inode numbers that probably contain the given xattr name or value. To support these we add items that index inodes by the presence of xattr items whose names or values hash to a give hash value. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-23 12:14:55 -07:00
Zach Brown	634114f364	scoutfs: update CKF key format The previous %llu for the key type came from the weird tracing functions that cast all the arguments to long long. Those have since been removed. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-23 12:05:59 -07:00
Zach Brown	6c12e7c38b	scoutfs: add hard link support Now that we have the link backrefs let's add support for hard links so we can verify that an inode can have multiple backrefs. (It can.) It's a straight forward refactoring of mknod to let callers either allocate or use existing inodes. We push all the btree item specific work into a function called by mknod and link. The only surprising bit is the small max link count. It's limiting the worst case buffer size for the inode_paths ioctl. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-17 16:22:00 -07:00
Zach Brown	0991622a21	scoutfs: add inode_paths ioctl This adds the ioctl that returns all the paths from the root to a given inode. The implementation only traverses btree items to keep it isolated from the vfs object locking and life cycles, but that could be a performance problem. This is another motivation to accelerate the btree code. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-11 16:46:18 -07:00
Zach Brown	77e0ffb981	scoutfs: track data blocks in bmap items Up to this point we'd been storing file data in large fixed size items. This obviously needed to change to get decent large file IO patterns. This wires the file IO into the usual page cache and buffer head paths so that we write data blocks into allocations referenced by btree items. We're aggressively trying to find the highest ratio of performance to implementation complexity. Writing dirty metadata blocks during transaction commit changes a bit. We need to discover if we have dirty blocks before trying to sync the inodes. We add our _block_has_dirty() function back and use it to avoid write attempts during transaction commit. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-10 15:18:45 -07:00
Zach Brown	198ec2ed5b	scoutfs: have btree_update return errors We can certainly have btree update callers that haven't yet dirtied the blocks but who can deal with errors. So make it return errors and have its only current caller freak out if it fails. This will let the file data block mapping code attempt to get a dirty item without first dirtying. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-09 17:03:30 -07:00

1 2 3

124 Commits