scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-09 21:27:25 +00:00

Author	SHA1	Message	Date
Andy Grover	cf278f5fa0	scoutfs: Tidy some enum usage Prefer named to anonymous enums. This helps readability a little. Use enum as param type if possible (a couple spots). Remove unused enum in lock_server.c. Define enum spbm_flags using shift notation for consistency. Rename get_file_block()'s "gfb" parameter to "flags" for consistency. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-30 13:35:44 -08:00
Andy Grover	73333af364	scoutfs: Use enum for lock mode Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-30 13:35:44 -08:00
Zach Brown	2f3d1c395e	scoutfs: show metadev_path in sysfs/mount_options We forgot to add metadev_path to the options that are found in the mount_options sysfs directory. Signed-off-by: Zach Brown <zab@versity.com>	2020-11-24 14:02:02 -08:00
Zach Brown	222e5f1b9d	scoutfs: convert endian in SCOUTFS_IS_META_BDEV We missed that flags is le64. Signed-off-by: Zach Brown <zab@versity.com>	2020-11-24 14:02:02 -08:00
Zach Brown	08eb75c508	scoutfs: update README.md for metadev_path Update the README.md introduction to scoutfs to mention the need for and use of metadata and data block devices. Signed-off-by: Zach Brown <zab@versity.com>	2020-11-19 11:41:20 -08:00
Andy Grover	9f151fde92	scoutfs: Use separate block devices for metadata and data Require a second path to metadata bdev be given via mount option. Verify meta sb matches sb also written to data sb. Change code as needed in super.c to allow both to be read. Remove check for overlapping meta and data blknos, since they are now on entirely separate bdevs. Use meta_bdev for superblock, quorum, and block.c reads and writes. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-19 11:41:20 -08:00
Zach Brown	ff532eba75	scoutfs: recover max lock write_version Write locks are given an increasing version number as they're granted which makes its way into items in the log btrees and is used to find the most recent version of an item. The initialization of the lock server's next write_version for granted locks dates back to the initial prototype of the forest of log btrees. It is only initialized to zero as the module is loaded. This means that reloading the module, perhaps by rebooting, resets all the item versions to 0 and can lead to newly written items being ignored in favour of older existing items with greater versions from a previous mount. To fix this we initialize the lock server's write_version to the greatest of all the versions in items in log btrees. We add a field to the log_trees struct which records the greatest version which is maintained as we write out items in transactions. These are read by the server as it starts. Then lock recovery needs to include the write_version so that the lock_server can be sure to set the next write_version past the greatest version in the currently granted locks. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Zach Brown	736d9d7df8	scoutfs: remove struct scoutfs_log_trees_val The log_trees structs store the data that is used by client commits. The primary struct is communicated over the wire so it includes the rid and nr that identify the log. The _val struct was stored in btree item values and was missing the rid and nr because those were stored in the item's key. It's madness to duplicate the entire struct just to shave off those two fields. We can remove the _val struct and store the main struct in item values, including the rid and nr. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Andy Grover	e6228ead73	scoutfs: Ensure padding in structs remains zeroed Audit code for structs allocated on stack without initialization, or using kmalloc() instead of kzalloc(). - avl.c: zero padding in avl_node on insert. - btree.c: Verify item padding is zero, or WARN_ONCE. - inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding. - net.c: zero pad in net header. - net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin(). - xattr.c: scoutfs_xattr has padding, zero it. - forest.c: item_root in forest_next_hint() appears to either be assigned-to or unused, so no need to zero it. - key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones} Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Andy Grover	13438c8f5d	scoutfs: Remove struct scoutfs_betimespec Unused. Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Andy Grover	d9d9b65f14	scoutfs: remove __packed from all struct definitions Instead, explicitly add padding field, and adjust member ordering to eliminate compiler-added padding between members, and at the end of the struct (if possible: some structs end in a u8[0] array.) This should prevent unaligned accesses. Not a big deal on x86_64, but other archs like aarch64 really want this. Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Andy Grover	5e1c8586cc	scoutfs: ensure btree values end on 8-byte-alignment boundary Round val_len up to BTREE_VALUE_ALIGN (8), to keep mid_free_len aligned. Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Andy Grover	68d7a2e2cb	scoutfs: align items in item cache to 8 bytes This will ensure structs, which are internally 8 byte aligned, will remain so when in the item cache. 16 bytes alignment doesn't seem like it's needed so just do 8. Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Andy Grover	87cb971630	scoutfs: fix hash compiler warnings Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	dc47ec65e4	scoutfs: remove btree value owner footer offset We were using a trailing owner offset to iterate over btree item values from the back of the block towards the front. We did this to reclaim fragmented free space in a block to satisfy an allocation instead of having to split the block, which is expensive mostly because it has to allocate and free metadata blocks. In the before times, we used to compact items by sorting items by their offset, moving them, and then sorting them by their keys again. The sorting by keys was expensive so we added these owner offsets to be able to compact without sorting. But the complexity of maintaining the owner metadata is not worth it. We can avoid the expensive sorting by keys by allocating a temporary array of item offsets and sorting only it by the value offset. That's nice and quick, it was the key comparisons that were expensive. Then we can remove the owner offset entirely, as well as the block header final free region that compaction needed. And we also don't compact as often in the modern era because we do the bulk of our work in the item cache instead of in the btree, and we've changed the split/merge/compaction heuristics to avoid constantly splitting/merging/comapcting and an item population happens to hover right around a shared threshold. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	dbea353b92	scoutfs: bring back sort_priv Bring back sort_priv, we have need for sorting with a caller argument. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	2e7053497e	scoutfs: remove free_*_blocks super fields Remove the old superblock fields which were used to track free blocks found in the radix allocators. We now walk all the allocators when we need to know the free totals, rather than trying to keep fields in sync. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	735c2c6905	scoutfs: fix btree split/join setting parent keys Before the introduction of the AVL tree to sort btree items, the items were sorted by sorting a small packed array of offsets. The final offset in that array pointed to the item in the block with the greatest key. With the move to sorting items in an AVL tree by nodes embedded in item structs, we now don't have the array of offsets and instead have a dense array of items. Creation and deletion of items always works with the final item in the array. last_item() used to return the item with the greatest key by returning the item pointed to by the final entry in the sorted offset array, then it returned the final entry in the item array for creation and deletion but that was no longer the item with the greatest key. But spliting and joining still used last_item() to find the item in the block with the greatest key for updating references to blocks in parents. Since the introduction of the AVL tree splitting and joining has been corrrupting the tree by setting parent block reference keys to whatever item happened to be at the end of the array, not the item with the greatest key. The extent code recently pushed hard enough to hit this by working with relatively random extent items in the core allocation btrees. Eventually the parent block reference keys got out of sync and we'd fail to find items by descending into the wrong children when looking for them. Extent deletion hit this during allocation, returned -ENOENT, and the allocator turned that into -ENOSPC. With this fixed we can repetedly create and delte millions of files with heavily fragmented extents in a tiny metadata device. Eventually it actually runs out of space instead of spuriously returning ENOSPC in a matter of minutes. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	a848477e64	scoutfs: remove unused packed exents We use full data extent items now, we don't need the packed extent structures. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	b094b18618	scoutfs: compact fewer srch files each time With the introduction of incremental srch file compaction we added some fields to the srch_compact struct to record the position of compaction in each file. This increased the size of the struct past the limit the btree places on the size of item values. We decrease the number of files per compaction from 8 to 4 to cut the size of the srch_compcat struct in half. This compacts twice as often, but still relatively infrequently, and it uses half the space for srch files waiting to hit the compaction threshold. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	7a3749d591	scoutfs: incremental srch compaction Previously the srch compaction work would output the entire compacted file and delete the input files in one atomic commit. The server would send the input files and an allocator to the client, and the client would send back an output file and an allocator that included the deletion of the input files. The server would merge in the allocator and replace the input file items with the output file item. Doing it this way required giving an enormous allocation pool to the client in a radix, which would deal with recursive operations (allocating from and freeing to the radix that is being modified). We no longer have the radix allocator, and we use single block avail/free lists instead of recursively modifying the btrees with free extent items. The compaction RPC needs to work with a finite amount of allocator resources that can be stored in an alloc list block. The compaction work now does a fixed amount of work and a compaction operation spans multiple work iterations. A single compaction struct is now sent between the client and server in the get_compact and commit_compact messages. The client records any partial progress in the struct. The server writes that position into PENDING items. It first searchs for pending items to give to clients before searching for files to start a new compaction operation. The compact struct has flags to indicate whether the output file is being written or the input files are being deleted. The server manages the flags and sets the input file deletion flag only once the result of the compaction has been reflected in the btree items which record srch files. We added the progress fields to the compaction struct, making it even bigger than it already was, so we take the time to allocate them rather than declaring them on the stack. It's worth mentioning that each operation now takes a reasonably bounded amount of time will make it feasible to decide that it has failed and needs to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	d589881855	scoutfs: add tot m/d device blocks to statfs_more The total_{meta,data}_blocks scoutfs_super_block fields initialized by mkfs aren't visible to userspace anywhere. Add them to statfs_more so that tools can get the totals (and use them for df, in this particular case). Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	2073a672a0	scoutfs: remove unused statfs RPC Remove the statfs RPC from the client and server now that we're using allocator iteration to calculate free blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	33374d8fe6	scoutfs: get statfs free blocks with alloc_foreach Use alloc_foreach to count the free blocks in all the allocators instead of sending an RPC to the server. We cache the results so that constant df calls don't generate a constant stream of IO. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	3d790b24d5	scoutfs: add alloc_detail ioctl An an ioctl which copies details of each persistent allocator to userspace. This will be used by a scoutfs command to give information about the allocators in the system. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	fb66372988	scoutfs: add alloc foreach cb iterator Add an alloc call which reads all the persistent allocators and calls a callback for each. This is going to be used to calculate free blocks in clients for df, and in an ioctl to give a more detailed view of allocators. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	8bf4c078df	scoutfs: fix item cache page split key choice The algorithm for choosing the split key assumed that there were multiple items in the page. That wasn't always true and it could result in choosing the first item as the split key, which could end up decrementing the left page's end key before it's start key. We've since added compaction to the paths that split pages so we now guarantee that we have at least two items in the page being split. With that we can be sure to use the second item's key and ensure that we're never creating invalid keys for the pages created by the split. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	27bc0ef095	scoutfs: fix item cache page trim The tests for the various page range intersections were out of order. The edge overlap case could trigger before the bisection case and we'd fail to remove the initial items in the page. That would leave items before the start key which would later be used as a midpoint for a split, causing all kinds of chaos. Rework the cases so that the overlap cases are last. The unique bisect case will be caught before we can mistake it for an edge overlap case. And minimize the number of comparisons we calculate by storing the handful that all the cases need. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	c4663ea1a1	scoutfs: compact items in item cache pages The first pass of the item cache didn't try to reclaim freed space at all. It would leave behind very sparse pages. The oldest of which would be reclaimed by memory pressure. While this worked, it created much more stress on the system than is necessary. Splitting a page with one key also makes it hard to calculate the boundaries of the split pages, given that the start and end keys could be the single item. This adds a header field which tracks the free space in item cache pgaes. Free space is created before the alloc offset by removing items from the rbtree, but also from shrinking item values when updating or deleting items. If we try to split a page with sufficient free space to insert the largest possible item then we compact the page instead of splitting it. We copy the items into the front of an unused page and swap the pages. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	e347ca3606	scoutfs: add unused item page rbtree verification Add a quick function that walks the rbtree and makes sure it doesn't see any obvious key errors. This is far too expensive to use regularly but it's handy to have around and add calls to when debugging. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	005cf99f42	scoutfs: use vmalloc for high order xattr allocs The xattr item stream is constructred from a large contiguous region that contains the struct header, the key, and the value. The value can be larger than a page so kmalloc is likely to fail as the system gets fragmented. Our recent move to the item cache added a significant source of page allocation churn which moved the system towards fragmentation much more quickly and was causing high-order allocation failures in testing. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	c61175e796	scoutfs: remove unused radix code Remove the radix allocator that was added as we expermented with packed extent items. It didn't work out. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	8f946aa478	scoutfs: add btree item extent allocator Add an allocator which uses btree items to store extents. Both the client and server will use this for btree blocks, the client will use it for srch blocks and data extents, and the server will move extents between the core fs allocator btree roots and the clients' roots. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	b605407c29	scoutfs: add extent layer Add infrastructure for working with extents. Callers provide callbacks which operate on their extent storage while this code performs the fiddly splitting and merging of extents. This layer doesn't have any persitent structures itself, it only operates on native structs in memory. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	b28acdf904	scoutfs: use larger percpu_counter batch The percpu_counter library merges the per-cpu counters with a shared count when the per-cpu counter gets larger than a certain value. The default is very small, so we often end up taking a shared lock to update the count. Use a larger batch so that we take the lock less often. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ae97ffd6fc	scoutfs: remove unused kvec.h We've removed the last use of kvecs to describe item values. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	12067e99ab	scoutfs: remove item granular work from forest Now that the item cache is bearing the load of high frequency item calls, we can remove all the item granular work that the forest was trying to do. The item cache amortizes the cost of the forest so its remaining methods can go straight to the btrees and don't need complicated state to reduce the overhead of item calls. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	6bacd95aea	scoutfs: fs uses item cache instead of forest Use the new item cache for all the item work in the fs instead of calling into the forest of btrees. Most of this is mechanical conversion from the _forest calls to the _item calls. The item cache no longer supports the kvec argument for describing values so all the callers pass in the value pointer and length directly. The item cache doesn't support saving items as they're deleted and later restoring them from an error unwinding path. There were only two users of this. Directory entries can easily guarantee that deletion won't fail by dirtying the items first in the item cache. Xattr updates were a little trickier. They can combine dirtying, creating, updating, and deleting to atomically switch between items that describe different versions of a multi-item value. This also fixed a bug in the srch xattrs where replacing an xattr would create a new id for the xattr and leave existing srch items referencing a now deleted id. Replacing now reuses the old id. And finally we add back in the locking and transaction item cache integration. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	45e594396f	scoutfs: add an item cache above the btrees Add an item cache between fs callers and the forest of btrees. Calling out to the btrees for every item operation was far too expensive. This gives us a flexible in-memory structure for working with items that isn't bound by the constrants of persistent block IO. We can rarely stream large groups of items to and from the btrees and then use efficient kernel memory structures for more frequent item operations. This adds the infrastructure, nothing is calling it yet. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	b1757a061e	scoutfs: add forest methods for item cache Add forest calls that the item cache will use. It needs to read all the items in the leaf blocks of forest btree which could contain the key, write dirty items to the log btree, and dirty bits in the bloom block as items are dirtied. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	1a994137f4	scoutfs: add btree methods for item cache Add btree calls to call a callback for all items in a leaf, and to insert a list of items into their leaf blocks. These will be used by the item cache to populate the cache and to write dirty items into dirty btree blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	57af2bd34b	scoutfs: give btree walk callers more keys The current btree walk recorded the start and end of child subtrees as it walked, and it could give the caller the next key to iterate towards after the block it returned. Future methods want to get at the key bounds of child subtrees, so we add a key range struct that all walk callers provide and fill it with all the interesting keys calculated during the walk. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	9e975dffe1	scoutfs: refactor btree split condition Btree traversal doesn't split a block if it has room for the caller's item. Extract this test into a function so that an upcoming btree call can test that each of multiple insertions into a leaf will fit. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	d440056e6f	scoutfs: remove unused xattr index code Remove the last remnants of the indexed xattrs which used fs items. This makes the significant change of renumbering the key zones so I wanted it in its own commit. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	d1e62a43c9	scoutfs: fix leaking alloc bits in merge In a merge where the input and source trees are the same, the input block can be an initial pre-cow version of the dirty source block. Dirtying blocks in the change will clear allocations in the dirty source block but they will remain in the pre-cow input block. The merge can then set these blocks in the dst, even though they were also used by allocation, because they're still set in the pre-cow input block. This fix is clumsy, but minimal and specific to this problem. A more thorough fix is being worked on which introduces more staging more allocator trees and should stop calls that are modifying the current active avail or free trees. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	289caeb353	scoutfs: trace leaf_bit of modified radix bits Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ba879b977a	scoutfs: expand radix merge tracing Add a trace event for entering _radix_merge() and rename the current per-merge trace event. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	5c6b263d97	scoutfs: trace radix bit ops before assertions Trace operations before they can trigger assertions so we can see the violating operation in the traces. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ca6b7f1e6d	scoutfs: lock invalidate only syncs dirty Lock invalidation has to make sure that changes are visible to future readers. It was syncing if the current transaction is dirty. This was never optimal, but it wasn't catastrophic when concurrent invalidation work could all block on one sync in progress. With the move to a single invalidation worker serially invalidating locks it became unacceptable. Invalidation happening in the presence of writers would constantly sync the current transaction while very old unused write locks were invalidated. Their changes had long since been committed in previous transactions. We add a lock field to remember the transaction sequence which could have been dirtied under the lock. If that transaction has already been comitted by the time we invalidate the lock it doesn't have to sync. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00

1 2 3 4 5 ...

907 Commits