scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-08 03:30:46 +00:00

Author	SHA1	Message	Date
Zach Brown	79110a74eb	scoutfs: prevent partial block stage, except final The staging ioctl is just a thin wrapper around writing. If we allowed partial-block staging then the write would zero a newly allocated block and only stage in the partial region of the block, leaving zeros in the file that didn't exist before. We prevent staging when the starting offset isn't block aligned. We prevent staging when the final offset isn't block aligned unless it matches the size because the stage ends in the final partial block of the file. This is verified by an xfstest (scoutfs/003) that is in flight. Signed-off-by: Zach Brown <zab@versity.com>	2017-09-07 13:49:37 -07:00
Zach Brown	599269e539	scoutfs: don't return uninit index entries Initially the index walking ioctl only ever output a single entry per iteration. So the number of entries to return and the next entry pointer to copy to userspace were maintained in the post-increment of the for loop. When we added locking of the index item results we made it possible to not copy any entries in a loop iteration. When that happened the nr and pointer would be incremented without initializing the entry. The ioctl caller would see a garbage entry in the results. This was visible in scoutfs/002 test results on a volume that had an interesting file population after having run through all the other scoutfs tests. The uninitialized entries would show up as garbage in the size index portion of the test. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Mark Fasheh	0011c185a9	scoutfs: plug the rest of our locking into dlmglue We move struct ocfs2_lock_res_ops and flags to dlmglue.c so that locks.c can get access to it. Similarly, we export ocfs2_lock_res_init_common() for locks.c can initialize each lockres before use. Also, free_lock_tree() now has to happen before we shut down the dlm - this gives dlmglue the opportunity to unlock their underlying dlm locks before we go off freeing the structures. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-08-24 11:45:15 -05:00
Mark Fasheh	021404bb6a	scoutfs: remove inode ctime index Like the mtime index, this index is unused. Removing it is a near identical task. Running the same createmany test from our last patch gives us the following: $ createmany -o '/scoutfs/file_%lu' 10000000 total: 10000000 creates in 598.28 seconds: 16714.59 creates/second real 9m58.292s user 0m7.420s sys 5m44.632s So after both indices are gone, we go from a 12m56 run time to 9m58s, saving almost 3 minutes which translates into a total performance increase of about 23%. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-08-22 15:59:13 -07:00
Mark Fasheh	d59367262d	scoutfs: remove inode mtime index This index is unused - we can gain some create performance by removing it. To verify this, I ran createmany for 10 million files: $ createmany -o '/scoutfs/file_%lu' 10000000 Before this patch: total: 10000000 creates in 776.54 seconds: 12877.56 creates/second real 12m56.557s user 0m7.861s sys 6m56.986s After this patch: total: 10000000 creates in 691.92 seconds: 14452.46 creates/second real 11m31.936s user 0m7.785s sys 6m19.328s So removing the index gained us about a minute and a half on the test or a 12% performance increase. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-08-22 15:59:13 -07:00
Zach Brown	c7ad9fe772	scoutfs: make release block granular The existing release interface specified byte regions to release but that didn't match what the underlying file data mapping structure is capable of. What happens if you specify a single byte to release? Does it release the whole block? Does it release nothing? Does it return an error? By making the interface match the capability of the operation we make the functioning of the system that much more predictable. Callers are forced to think about implementing their desires in terms of block granular releasing. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-14 09:19:03 -07:00
Zach Brown	c1b2ad9421	scoutfs: separate client and server net processing The networking code was really suffering by trying to combine the client and server processing paths into one file. The code can be a lot simpler by giving the client and server their own processing paths that take their different socket lifecysles into account. The client maintains a single connection. Blocked senders work on the socket under a sending mutex. The recv path runs in work that can be canceled after first shutting down the socket. A long running server work function acquires the listener lock, manages the listening socket, and accepts new sockets. Each accepted socket has a single recv work blocked waiting for requests. That then spawns concurrent processing work which sends replies under a sending mutex. All of this is torn down by shutting down sockets and canceling work which frees its context. All this restructuring makes it a lot easier to track what is happening in mount and unmount between the client and server. This fixes bugs where unmount was failing because the monolithic socket shutdown function was queueing other work while running while draining. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-04 10:47:42 -07:00
Zach Brown	0b64a4c83f	scoutfs: lock inode index item iteration Add locks around inode index item iteration. This is tricky because the inode index items are enormous and we can't default to coarse locks that let it read and iterate over the entire key space. We use the manifest to find the next small fixed size region to lock and iterate from. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	8d29c82306	scoutfs: sort keys by zone, then inode, then type Holding a DLM lock protects a range of the key space. The DLM locks span inodes or regions of inodes. We need the sort order in LSM items to match the DLM range keys so that we can read all the items covered by a lock into the cache from a region of LSM segments. If their orders differered then we'd have to jump around segments to find all the items covered by a given DLM lock. Previously we were sorting by type then, within types, by inode. Now we want to sort by inode then by type. But there are structures which previously had a type but weren't then sorted by inode. We introduce zones as the primary sort key. Inode index and node zones are sorted by the inode fields and node ids respectively. Then comes the fs zone first sorted by inode then the type of the key. The bulk of this is the mechanical introduction of the zone field to the keys, moving the type field down, and a bulk rename of _KEY to _TYPE. But there are some more substantial changes. The orphan keys needed to be put in a zone. They fit in the NODE zone which is all about resources that nodes hold and would need to be cleaned up if the node went away. The key formatting is significantly changed to match the new formatting. Formatted keys are now generally of the form "zone.primary.type..." And finally with the keys now properly sorted by inodes we can correctly construct a single range of item cache keys to invalidate when unlocking the inode group locks. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	2eecbbe78a	scoutfs: add item cache key ioctls These ioctls let userspace see the items and ranges that are cached. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	b7bbad1fba	scoutfs: add precise transation item reservations We had a simple mechanism for ensuring that transaction didn't create more items than would fit in a single written segment. We calculated the most dirty items that a holder could generate and assumed that all holders dirtied that much. This had two big problems. The first was that it wasn't accounting for nested holds. write_begin/end calls the generic inode dirtying path whild holding a transaction. This ended up deadlocking as the dirty inode waited to be able to write while its trans held back in write_begin prevented writeout. The second was that the worst case (full size xattr) item dirtying is enormous and meaningfully restricts concurrent transaction holders. With no currently dirty items you can have less than 16 full size xattr writes. This concurrency limit only gets worse as the transaction fills up with dirty items. This fixes those problems. It adds precise accounting of the dirty items that can be created while a transaction is held. These reservations are tracked in journal_info so that they can be used by nested holds. The precision allows much greater concurrency as something like a create will try to reserve a few hundreds bytes instead of 64k. Normal sized xattr operations won't try to reserve the largest possible space. We add some feedback from the item cache to the transaction to issue warnings if a holder dirties more items than it reserved. Now that we have precise item/key/value counts (segment space consumption is a function of all three :/) we can't have a single atomic track transaction holders. We add a long-overdue trans_info and put a proper lock and fields there and much more clearly track transaction serialization amongst the holders and writer. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:15:13 -07:00
Zach Brown	5f11cdbfe5	scoutfs: add and index inode meta and data seqs For each transaction we send a message to to the server asking for a unique sequence number to associate with the transaction. When we change metadata or data of an inode we store the current transaction seq in the inode and we index it with index items like the other inode fields. The server remembers the sequences it gives out. When we go to walk the inode sequence indexes we ask the server for the largest stable seq and limit results to that seq. This ensures that we never return seqs that are past dirty items so never have inodes and seqs appear in the past. Nodes use the sync timer to regularly cycle through seqs and ensure that inode seq index walks don't get stuck on their otherwise idle seq. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:12:24 -07:00
Zach Brown	5307c56954	scoutfs: add a stat_more ioctl We have inode fields that we want to return to userspace with very low overhead. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 14:28:10 -07:00
Zach Brown	b97587b8fa	scoutfs: add indexing of inodes by fields Add items for indexing inodes by their fields. When we update the inode item we also delete the old index items and create the new items. We rename and refactor the old inode since ioctl to now walk the inode index items. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:12 -07:00
Zach Brown	e34f8db4a9	scoutfs: add release argument and result tracing Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:12 -07:00
Zach Brown	a262a158ce	scoutfs: fix single block release The offset comparison in release that was meant to catch wrapping was inverted and accidentally prevented releasing a single block. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:12 -07:00
Zach Brown	97cb75bd88	Remove dead btree, block, and buddy code Remove all the unused dead code from the previous btree block design. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:55 -07:00
Zach Brown	02af35a98e	Convert inode since ioctl to the item API The inode since ioctl was the last user of the btree. It doesn't yet work because the item cache doesn't know how to search for items by sequence yet. It's not yet clear exactly how we'll build the data since ioctls. It'll be easy enough to refactor the inode since item walk if they follow a similar pattern again. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	a310027380	Remove the find xattr ioctls The current plan for finding populations of inodes to search no longer involves xattr backrefs. We're about to change the xattr storage format so let's remove these interfaces so we don't have to update them. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	fff6fb4740	Restore link backref items Convert the link backref code from btree items to the item cache. Now that the backref items have the full entry name we can traverse a link with one item lookup. We don't need to lock the inode and verify that the entry at the backref offset really points to our inode. The link backref walk gets a lot simpler. But we have to widen the ioctl cursor to store a full dir ino and path name isntead of just the dir's backref counter. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	9f5e42f7dd	Add simple data items Add basic file data support by managing file data items from the page cache address space callbacks. Data is read by copying from cached items into page contents in readpage. Writes create new ephemeral items which reference dirty pages. The items are deleted once they're written in a transaction or if invalidatepage removes the dirty page they reference. There's a lot more to do to remove data copies, avoid compaction bw overhead, and add support for truncate, o_direct, and mmap. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:44:54 -07:00
Zach Brown	c6b688c2bf	Add staging ioctl This adds the ioctl for writing archived file contents back into the file if the data_version still matches. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:08 -08:00
Zach Brown	df561bbd19	Add offline extent flag and release ioctl Add the _OFFLINE flag to indicate offline extents. The release ioctl frees extents within the release range and sets their _OFFLINE flag if the data_version still matches. We tweak the existing truncate item function just a bit to support making extents offline. We make it take an explicit range of blocks to remove instead of just giving it the size and it learns to mark extents offline and update them instead of always deleting them. Reads from offline extents return zeros like reading from a sparse region (later it will trigger demand staging) and writing to offline extents clears the offline flag (later only staging can do that). Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:08 -08:00
Zach Brown	5d87418925	Add ioctl for sampling inode data version Add an ioctl that samples the inode's data_version. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:08 -08:00
Mark Fasheh	467801de73	scoutfs: use extents for file data We're very basic here at this stage and simply put a single-block extent item where we would have previously had a multi-block bmap item. Multi-block extents will come in future patches. Signed-off-by: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Zach Brown <zab@versity.com>	2016-11-16 14:45:08 -08:00
Zach Brown	ae6cc83d01	Raise the nlink limit A few xfstests tests were failing because they tried to create a decent number of hard links to a file. We had a small nlink limit because the inode-paths ioctl copied all the paths for all the hard links to a userspace buffer which could be enormous if there was a larger nlink limit. The hard link backref disk format already has a natural counter that could be used as a cursor to iterate over all the hard links that point to a given inode. This refactors the inode_paths ioctl into a ino_path ioctl that returns a single path for the given counter and returns the counter for the next path that links to the inode. Happily this lets us get rid of all the weird path component lists and allocations. Now there's just the kernel path buffer that gets null terminated path components and the userspace buffer that those are copied to. We don't fully relax the nlink limit. stat(2) returns the link count as a u32. We go a step further and limit it to S32_MAX so that apps might avoid sign bugs. That still gives us a more generous limit than ext4 and btrfs which are around U16_MAX. Signed-off-by: Zach Brown <zab@versity.com> Reviewed-by: Mark Fasheh <mfasheh@versity.com>	2016-11-16 14:45:08 -08:00
Zach Brown	165d833c46	Walk stable trees in _since ioctls The _since ioctls walk btrees and return items that are newer than a given sequence number. The intended behaviour is that items will appear in a greater sequence number if they change after appearing in the queries. This promise doesn't hold for items that are being modified in the current transaction. The caller would have to always ask for seq X + 1 after seeing seq X to make sure it got all the changes that happened in seq X while it was the current dirty transaction. This is fixed by having the interfaces walk the stable btrees from the previous transaction. The results will always be a little stale but userspace already has to deal with stale results because it can't lock out change, and it can use sync (and a commit age tunable we'll add) to limit how stale the results can be. Signed-off-by: Zach Brown <zab@versity.com>	2016-11-08 16:05:36 -08:00
Zach Brown	16e94f6b7c	Search for file data that has changed We don't overwrite existing data. Every file data write has to allocate new blocks and update block mapping items. We can search for inodes whose data has changed by filtering block mapping item walks by the sequence number. We do this by using the exact same code for finding changed inodes but using the block mapping key type. Signed-off-by: Zach Brown <zab@versity.com>	2016-10-20 13:55:14 -07:00
Zach Brown	84f23296fd	scoutfs: remove btree cursor The btree cursor was built to address two problems. First it accelerates iteration by avoiding full descents down the tree by holding on to leaf blocks. Second it lets callers reference item value contents directly to avoid copies. But it also has serious complexity costs. It pushes refcounting and locking out to the caller. There have already been a few bugs where callers did things while holding the cursor without realizing that they're holding a btree lock and can't perform certain btree operations or even copies to user space. Future changes to the allocator to use the btree motivates cleaning up the tree locking which is complicated by the cursor being a stand alone lock reference. Instead of continuing to layer complexity onto this construct let's remove it. The iteration acceleration will be addressed the same way we're going to accelerate the other btree operations: with per-cpu cached leaf block references. Unlike the cursor this doesn't push interface changes out to callers who want repeated btree calls to perform well. We'll leave the value copying for now. If it becomes an issue we can add variants that call a function to operate on the value. Let's hope we don't have to go there. This change replaces the cursor with a vector to memory that the value should be copied to and from. The vector has a fixed number of elements and is wrapped in a struct for easy declaration and initialization. This change to the interface looks noisy but each caller's change is pretty mechanical. They tend to involve: - replace the cursor with the value struct and initialization - allocate some memory to copy the value in to - reading functions return the number of value bytes copied - verify copied bytes makes sense for item being read - getting rid of confusing ((ret = _next())) looping - _next now returns -ENOENT instead of 0 for no next item - _next iterators now need to increase the key themselves - make sure to free allocated mem Sometimes the order of operations changes significantly. Now that we can't modify in place we need to read, modify, write. This looks like changing a modification of the item through the cursor to a lookup/update pattern. The symlink item iterators didn't need to use next because they walk a contiguous set of keys. They're changed to use simple insert or lookup. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	2bed78c269	scoutfs: specify btree root The btree functions currently don't take a specific root argument. They assume, deep down in btree_walk, that there's only one btree in the system. We're going to be adding a few more to support richer allocation. To prepare for this we have the btree functions take an explicit btree argument. This should make no functional difference. Signed-off-by: Zach Brown <zab@versity.com>	2016-09-21 10:04:07 -07:00
Zach Brown	c90710d26b	scoutfs: add find xattr ioctls Add ioctls that return the inode numbers that probably contain the given xattr name or value. To support these we add items that index inodes by the presence of xattr items whose names or values hash to a give hash value. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-23 12:14:55 -07:00
Zach Brown	0991622a21	scoutfs: add inode_paths ioctl This adds the ioctl that returns all the paths from the root to a given inode. The implementation only traverses btree items to keep it isolated from the vfs object locking and life cycles, but that could be a performance problem. This is another motivation to accelerate the btree code. Signed-off-by: Zach Brown <zab@versity.com>	2016-08-11 16:46:18 -07:00
Zach Brown	90a73506c1	scoutfs: remove homebrew tracing Oh, thank goodness. It turns out that there's a crash extension for working with tracepoints in crash dumps. Let's use standard tracepoints and pretend this tracing hack never happened. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-20 12:08:12 -07:00
Zach Brown	b51511466a	scoutfs: add inodes_since ioctl Add the ioctl that let's us find out about inodes that have changed since a given sequence number. A sequence number is added to the btree items so that we can track the tree update that it last changed in. We update this as we modify items and maintain it across item copying for splits and merges. The big change is using the parent item ref and item sequence numbers to guide iteration over items in the tree. The easier change is to have the current iteration skip over items whose sequence number is too old. The more subtle change has to do with how iteration is terminated. The current termination could stop when it doesn't find an item because that could only happen at the final leaf. When we're ignoring items with old seqs this can happen at the end of any leaf. So we change iteration to keep advancing through leaf blocks until it crosses the last key value. We add an argument to btree walking which communicates the next key that can be used to continue iterating from the next leaf block. This works for the normal walk case as well as the seq walking case where walking terminates prematurely in an interior node full of parent items with old seqs. Now that we're more robustly advancing iteration with btree walk calls and the next key we can get rid fo the 'next_leaf' hack which was trying to do the same thing inside the btree walk code. It wasn't right for the seq walking case and was pretty fiddly. The next_key increment could wrap the maximal key at the right spine of the tree so we have _inc saturate instead of wrap. And finally, we want these inode scans to not have to skip over all the other items associated with each inode as it walks looking for inodes with the given sequence number. We change the item sort order to first sort by type instead of by inode. We've wanted this more generally to isolate item types that have different access patterns. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:46:20 -07:00
Zach Brown	7d6dd91a24	scoutfs: add tracing messages This adds tracing functionality that's cheap and easy to use. By constantly gathering traces we'll always have rich history to analyze when something goes wrong. Signed-off-by: Zach Brown <zab@versity.com>	2016-05-28 11:11:15 -07:00

35 Commits