scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-07 19:20:44 +00:00

Author	SHA1	Message	Date
Zach Brown	2591e54fdc	Make it easier to build scoutfs.ko We were duplicating the make args a few times so make a little ARGS variable. Default to the /lib/modules/$(uname -r) installed kernel source if SK_KSRC isn't set. And only try a sparse build that can fail if we can execute the sparse command. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2017-04-18 14:03:24 -07:00
Nic Henke	9fc47dedf8	Add unlocked ioctls for directories. The use of the Scout ioctls for inode-since and data-since on the root directory is a rather helpful boost. This allows user code to start on blank filesystems and monitor activity without needing to create files. The existing ioctl code was already present, so wiring into the directory file operations was all that needed to happen. Signed-off-by: Nic Henke <nic.henke@versity.com> Reviewed-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2017-04-18 14:03:24 -07:00
Zach Brown	e61697a54e	Add generic file and dir seek methods Two more xfstests pass when we can seek in files and dirs. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2017-04-18 14:03:22 -07:00
Zach Brown	efd95688d3	Add printf format checking to scoutfs msg funcs scoutfs_msg() was missing the attribute to check printf formats and arguments. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2017-04-18 13:59:54 -07:00
Zach Brown	cec3f9468a	Further isolate rings and compaction Each mount was still loading the manifest and allocator rings and starting compaction, even if they were coordinating segment reads and writes with the server. This moves ring and compaction setup and teardown from on mount and unmount to as the server starts up and shuts down. Now only the server has the rings resident and is running compaction. We had to null some of the super info fields so that we can repeatedly load and destroy the ring indices over the lifetime of a mount. We also have to be careful not to call between item transactions and compaction. We'll restore this functionality with the server in the future. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	5eefaf34f8	Server updates ring for level0 segment writes Transaction commits currently directly modify the ring and super block as segments are written. As we introduce shared mounts only the server can modify the ring and super blocks. This adds network messages to let mounts write items in a level 0 segment while the server modifies the allocator and manifest. The item transaction commit now sends a message to the server to get an allocated segno for its new level0 segment and sends a manifest entry to the server once the segment is written. The request and reply handlers for the functions are straight forward. The processing paths are simple wrappers around the allocation and update functions that transaction writing used to call directly. Now that the item transactions aren't updating the super sync can't work with the super sequence numbers. The server needs to make both allocations and manifest updates persistent before it sends replies to the client. We add the ability for the server processing paths to queue and wait for commits of the rings and super block. We can hopefull get reasonable batching by using a work struct for the commit. We update the other processing path callers that modify the rings to use the new commit mechanism. We add a few segment and manifest functions to work with manifest entries that describe segments. This creats a bit of similar looking code thorughout the segment and manifest code but we'll come back and clean this up once we see what the final shared support looks like. scoutfs_seg_alloc() now takes the segno from the caller for the segment it's allocating and inserting into the cache. Transaction commit uses the segno it got from the server while compaction still allocates locally. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	5487aee6a7	Read items with manifest entries from server Item reading tries to directly walk the manifest to find segments to read. That doesn't work when only the server has read the ring and loaded the manifest. This adds a network message to ask the server for the manifest entries that describe the segments that will be needed to read items. Previously item reading would walk the manifest and build up native manifest references in a list that it'd use to read. To implement the network message we add request sending, processing, and reply parsing around those original functions. Item reading now packs its key range and sends it to the server. The server walks the manifest and sends the entries that intersect with the key range. Then the reply function builds up the native manifest references that item reading will use. The net reply functions needed an argument so that the manifest reading request could pass in the caller's list that the native manifest references should be added to. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	b50de90196	Alloc inodes from pool from server Inode allocation was always modifying the in-memory super block. This doesn't work when the server is solely responsible for modifying the super blocks. We add network messages to have mounts send a message to the server to request inodes that they can use to satisfy allocation. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	453715a78d	Only shutdown locks that were setup Lock shutdown was crashing trying to deref a null linf on cleanup from mont errors that happened before locks were setup. Make sure lock shutdown only tries to do work if the locks have been setup. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	45882f5a77	Add some ring tracing Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	5e0e9ac12e	Move to much simpler manifest/alloc storage Using the treap to be able to incrementally read and write the manifest and allocation storage from all nodes wasn't quite ready for prime time. The biggest problem is that invalidating cached nodes which are the target of native pointers, either for consistency or memory pressure, is problematic. This was getting in the way of adding shared support as readers and writers try to use as much of their treap caches as they can. There were other serious problems that we'd run into eventually: memory pressure from duplicate caching in native nodes and the page cache, small IOs from reading a page at a time, the risk of pathologically imbalanced treaps, and the ring being corrupted if the migration balancing doesn't work (the model assumed you could always dirty an individual node in a transaction, you have to dirty all the parents in each new transaction). Let's back off to a much simpler mechanism while we build the rest of the system around it. We can revisit aggressively optimizing this when it's our worst problem. We'll store the indexes that the manifest server needs in simple preallocated rings with log entries. The server has to read the index in its entirety into a native rbtree before it can work on it. We won't access the physical ring from mounts anymore, they'll send messages to the server. The ring callers are now working with a pinned tree in memory so the interface can be a bit simpler. By storing the indexes in their own rings the code and write path become a lot simper: we have an IO submission path for each index instead of "dirtying" calls per index and then a writing call. All this is much more robust and much less likely to get in our way as we stand up the rest of the system around it. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	86d3090982	Tighten lock range error handling If lock_range returns an error then the caller won't unlock the range. Make sure to unlock the range if we have it locked when we get errors that we're going to return to the caller. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	104bbb06a9	Remove cached range when invalidating items When invalidating items we need to remove the cached range that covers the range of keys that we're removing so that the removed items aren't then considered negative cached items. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	2ea5f1d734	invalidate_others could return uninit ret Make sure to initialize ret in case there aren't other mounts. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	8c59902b70	scoutfs: cleanup socket callbacks The first attempt at wiring up the socket callbacks was a bit too precious. We can simplify and do what other modern socket callback users do: don't bother with the callback locks and call shutdown before release. We also protect against spurious callbacks by only doing work in the callbacks when the sk user_data points to a sock_info which points back to the socket. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	27e55eb43c	Flesh out some pieces of the scoutfs.md doc Trying to keep adding coverage across the design. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	39ae89d85f	Add network messaging between mounts We're going to need communication between mounts to update and distribute the manifest and allocators in the treap ring. This adds a netwoking core where one mount becomes the server and other mounts send requests to it. The messaging semantics are pretty simple in that clients reliably send requests and the server passively reply to requests. Complexity beyond that is up to the callers implementing the requests. It relies on locking to establish the server role and to broadcast the address of the server socket. We add a trivial lvb back to our local test locking implementation to store the address. We also add the ability to shut down locking so that the locking networking work stops blocking. A little demonstration request is included which just gives visibility into client and server clocks in the trace logs. Next up we'll add the requests that do real work. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	392ed81c43	Add some simple lock/invalidation tracing Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	955d940c64	Restore key tracing Now that the keys are a contiguous buffer we can format them for the trace buffers with a much more straight forward type check around per-key snprintfs. We can get rid of all the weird kvec code that tried to deal with keys that straddled vectors. With that fixed we can uncomment out the tracing statements that were waiting the key formatting. I was testing with xattr keys so they're added as the code is updated. The rest of the key types will be added seperately as they're used. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	607eff9b7c	Add range locking to xattr ops We can use easy xattrs to test range locking and item consistency between mounts. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:49:16 -07:00
Zach Brown	b3b2693939	Add simple debugging range locking layer We can work on shared mechanics without requiring a full locking server. We can stand up a simple layer which uses shared data structures in a kernel image to lock between mounts in the same kernel. On mount we add supers to a list. Held locks are tracked in a rbtree. A lock attempt blocks until it doesn't conflict with anything in the rbtree. As locks are acquired we walk all the other supers and write/invaludate any items they have which intersect with the acquired range. This is easier to implement and less efficient than caching locks after they're unlocked and implementing downconvert/blocking/revoke. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	f373f05fb7	Add engineering markdown document Let's put the engineering doc in the source tree so that eventually it'll be easily found upstream. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	97cb75bd88	Remove dead btree, block, and buddy code Remove all the unused dead code from the previous btree block design. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	6bcdca3cf9	Update dirent last pos and update first comment The last valid pos for us is now a full u64 because we're storing entries at an increasing counter instead of at a hahs of the entry name. And might as well add a clarifying comment to the first pos while we're here. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	00fed84c68	Build statfs f_blocks from total_segs Use the current total_segs field to calculate the total number of blocks in the system instead of the old and redundant total_segs field which is going away. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	02af35a98e	Convert inode since ioctl to the item API The inode since ioctl was the last user of the btree. It doesn't yet work because the item cache doesn't know how to search for items by sequence yet. It's not yet clear exactly how we'll build the data since ioctls. It'll be easy enough to refactor the inode since item walk if they follow a similar pattern again. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	429e1b6eb4	Truncate data items scoutfs_data_truncate_items() was still using the btree. This updates it to use the item cache but doesn't yet support regions being offline. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	92b10e8270	Write super with bio functions Write our super block from an allocated page with our bio functions instead of relying on the old block cache layer which is going away. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	75b018a0e7	Add symlinks back Convert symlinks to use the new item cache API. This is so much easier because our max item size matches the symlink size. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	54e07470f1	Update xattrs to use the item cache Update the xattrs to use the item cache. Because we now have large keys we can store the xattr at its full name instead of having to deal with hashing the name and addressing collisions. Now that we don't have the find xattr ioctls we don't need to maintain backrefs. We also add support for large xattrs that span multiple items. The key footer and value header give us the metadata we need to iterate over the items that make up an xattr. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	64bc145e3c	Add scoutfs_item_set_batch() We're about to update xattrs to use the item cache API and xattrs want to be pretty big. scoutfs_item_set_batch() let's the xattr code atomically update xattrs made up of multiple items. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	a310027380	Remove the find xattr ioctls The current plan for finding populations of inodes to search no longer involves xattr backrefs. We're about to change the xattr storage format so let's remove these interfaces so we don't have to update them. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	fff6fb4740	Restore link backref items Convert the link backref code from btree items to the item cache. Now that the backref items have the full entry name we can traverse a link with one item lookup. We don't need to lock the inode and verify that the entry at the backref offset really points to our inode. The link backref walk gets a lot simpler. But we have to widen the ioctl cursor to store a full dir ino and path name isntead of just the dir's backref counter. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	8def9141bc	Add scoutfs_key_init_buf_len() As of yet the static key users have key and buffer lengths that match. We're about to add a link backref caller who searches with a small key but gets a result copied into a larger buffer. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	6516ce7d57	Report free blocks in statfs Our statfs callback was still using the old buddy allocator. We add a free segments field to the super and have it track the number of free segments in the allocator. We then use that to calculate the number of free blocks for statfs. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9f5e42f7dd	Add simple data items Add basic file data support by managing file data items from the page cache address space callbacks. Data is read by copying from cached items into page contents in readpage. Writes create new ephemeral items which reference dirty pages. The items are deleted once they're written in a transaction or if invalidatepage removes the dirty page they reference. There's a lot more to do to remove data copies, avoid compaction bw overhead, and add support for truncate, o_direct, and mmap. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	1ad479a1af	Add ephemeral items Ephemeral items exist to reference external values. They're going to be used by the page cache to reference dirty pages for writeback. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	c3307e941b	Add scoutfs_item_forget() Add a forget call which forcefully removes an item, no matter it's state. The page cache will use this in invalidate page to drop ephemeral items that reference a dirty page that's being truncated. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9f885b4c12	Fix item erase augmentation The item cache was getting inconsistent as items were removed. This would manifest in failing to find dirty items that it had counted as it was writing items into the segment and removing deletion items. For a start it wasn't using the augmented rb_erase(). We make a function that everyone uses. There's no augmented rb_replace() so We just augment erase, restart, and insert. (We could probably augment on descent and replace/propagate but that can come later.) Then the augmentation callbacks got the semantics slightly wrong. The rotation callback is named after a caller that happens to use it, not on any implied relationship between the nodes. It actually just recalculates the augmentation value for the two subtrees. Mischief managed. (We'll probably rework the augmentation so the value is for the node and its children and we can get rid of the extra code we have today to support our augmentation value that is sensitive to the difference between the left and write subtrees.) Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	568cefa4db	Add some item debugging tracing to seg writing Trace the items that we count and then write to the segment. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	7045e3a6e8	More efficiently destroy item rbtrees I was auditing rb_erase() use and noticed that we we don't need to fully tear down the item trees. We can just blow them away with postorder traversal and raw frees of the nodes. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	0298cbb562	Fix compact cleanup on mount failure scoutfs_compact_destroy() was testing the wrong pointer to see if _setup() had built up resources that needed to be torn down. It'd crash on mount failure. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	67aec72c77	Add readdir items Restore readdir functionality by adding readdir items. The readdir items are keyed by an increasing position in the parent dir's inode. We track it in our inode info. To delete the readdir items we restore the dentry_info and put the pos in the dentry so unlink can build the readdir item key. And finally we put the pos in the lookup dirent so that it can populate the dentry info on lookup. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9a293bfa75	Add item delete dirty and many interfaces Add item functions for deleting items that we know to be dirty and add a user in another function that deletes many items without leaving parial deletions behind in the case of errors. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	f139cf4a5e	Convert unlink and orphan processing Restore unlink functionality by converting unlink and orphan item processing from the old btree interface to the new item cache interface. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9d6d70bd89	Add an item next for key len ignoring val Add scoutfs_item_next_same() which requires that the key lengths be identical but which allows any values, including no value by way of a null kvec. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9d68e272cc	Allow creation of items with no value Item creation always tried to allocate a value. We have some item types which don't have values. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	8f63196318	Add key inc/dec variants for partial keys Some callers know that it's safe to increment their partial keys. Let them do so without trying to expand the keys to full precision and triggering warnings that their buffers aren't large enough. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	2ac239a4cb	Add deletion items So far we were only able to add items to the segments. To support deletion we have to insert deletion items and then remove them and the item they reference when their segments are compacted. As callers attempt to delete items from the item cache we replace the existing item with a deletion marker with the key but no value. Now that there are deletion items in the cache we have to teach the other item cache operations to skip them. There's some noise in the patch from moving functions around so that item insertion can free a deletion item it finds. The deletion items are written out to the segment as usual except now the in-segment item struct has a flag to mark a deletion item and the deletion item is removed from the cache once its written to the segment. Item reading knows to skip deletion items and not add them back into the cache. Compaction proceeds as usual for most of the levels with the deletion item clobbering any older higher level items with the same key. Eventually the deletion item itself is removed by skipping over it when compacting to the largest final level. We support this by adding a little call that describes the max level of the tree at the time the compaction starts so that compaction can tell when it should skip copying the deletion item to the final lower level. All of this is for deletion of items with a precise key. In the future we'll expand the deletion items so that they can reference a contiguous range of keys. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	cfc6d72263	Remove item off and len packing The key and value offsets and lengths were aggressively packed into the item structs in the segments. This saved a few bytes per item but didn't leave any room left for expansion without growing the item. We want to add a deletion item flag so let's just grow the item struct. It now has room for full precision offsets and lengths that we can access natively so we can get rid fo the packing and unpacking functions. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00

1 2 3 4 5 ...

257 Commits