scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-02 02:45:43 +00:00

Author	SHA1	Message	Date
Auke Kok	4ef64c6fcf	Vfs methods become user namespace mount aware. v5.11-rc4-24-g549c7297717c All of these VFS methods are now passed a user_namespace. Signed-off-by: Auke Kok <auke.kok@versity.com>	2024-10-03 12:41:05 -07:00
Zach Brown	4a8240748e	Add project ID support Add support for project IDs. They're managed through the _attr_x interfaces and are inherited from the parent directory during creation. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	fb5331a1d9	Add inode retention bit Add a bit to the private scoutfs inode flags which indicates that the inode is in retention mode. The bit is visible through the _attr_x interface. It can only be set on regular files and when set it prevents modification to all but non-user xattrs. It can be cleared by root. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	6931cb7b0e	Add scoutfs_inode_[gs]et_flags Add functions for getting and setting our private inode flags. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 14:53:49 -07:00
Zach Brown	8e37be279c	Use seqlock to protect inode fields We were using a seqcount to protect high frequency reads and writes to some of our private inode fields. The writers were serialized by the caller but that's a bit too easy to get wrong. We're already storing the write seqcount update so the additional internal spinlock stores in seqlocks isn't a significant additional overhead. The seqlocks also handle preemption for us. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-25 15:11:20 -07:00
Auke Kok	d480243c11	Support .read/write_iter callbacks in lieu of .aio_read/write The aio_read and aio_write callbacks are no longer used by newer kernels which now uses iter based readers and writers. We can avoid implementing plain .read and .write as an iter will be generated when needed for us automatically. We add a new data_wait_check_iter() function accordingly. With these methods removed from the kernel, the el8 kernel no longer uses the extended ops wrapper struct and is much closer now to upstream. As a result, a lot of methods are moving around from inode_dir_operations to and from inode_file_operations etc, and perhaps things will look a bit more structured as a result. As a result, we need a slightly different data_wait_check() that accounts for the iter and offset properly. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Auke Kok	ec50e66fff	Timespec64 changes for yr2038. Provide a fallback `current_time(inode)` implementation for older kernels. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	276fbebdac	Avoid dput in lock invalidation The d_prune_aliases in lock invalidation was thought to be safe because the caller had an inode refernece, surely it can't get into iput_final. I missed the fundamental dcache pattern that dput can ascend through parents and end up in inode eviction for entirely unrelated inodes. It's very easy for this to deadlock, imagine if nothing else that the inode invalidation is blocked on in dput->iput->evict->delete->lock is itself in the list of locks to invalidate in the caller. We fix this by always kicking off d_prune and dput into async work. This increases the chance that inodes will still be referenced after invalidation and prevent inline deletion. More deletions can be deferred until the orphan scanner finds them. It should be rare, though. We're still likely to put and drop invalidated inodes before a writer gets around to removing the final unlink and asking us for the omap that describes our cached inodes. To perform the d_prune in work we make it a behavioural flag and make our queued iputs a little more robust. We use much safer and understandable locking to cover the count and the new flags and we put the work in re-entrant work in their own workqueue instead of one work instance in the system_wq. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-02 12:28:13 -08:00
Zach Brown	71ed4512dc	Include primary lock write_seq for write_only vers FS items are deleted by logging a deletion item that has a greater item version than the item to delete. The versions are usually maintained by the write_seq of the exclusive write lock that protects the item. Any newer write hold will have a greater version than all previous write holds so any items created under the lock will have a greater vers than all previous items under the lock. All deletion items will be merged with the older item and both will be dropped. This doesn't work for concurrent write-only locks. The write-only locks match with each other so their write_seqs are asssigned in the order that they are granted. That grant order can be mismatched with item creation order. We can get deletion items with lesser versions than the item to delete because of when each creation's write-only lock was granted. Write only locks are used to maintain consistency between concurrent writers and readers, not between writers. Consistency between writers is done with another primary write lock. For example, if you're writing seq items to a write-only region you need to have the write lock on the inode for the specific seq item you're writing. The fix, then, is to pass these primary write locks down to the item cache so that it can chose an item version that is the greatest amongst the transaction, the write-only lock, and the primary lock. This now ensures that the primary lock's increasing write_seq makes it down to the item, bringing item version ordering in line with exclusive holds of the primary lock. All of this to fix concurrent inode updates sometimes leaving behind duplicate meta_seq items because old seq item deletions ended up with older versions than the seq item they tried to delete, nullifying the deletion. Signed-off-by: Zach Brown <zab@versity.com>	2022-11-15 13:26:32 -08:00
Zach Brown	bddca171ee	Call iput outside cluster locked transactions The final iput of an inode can delete items in cluster locked transactions. It was never safe to call iput within locked transactions but we never saw the problem. Recent work on inode deletion raised the issue again. This makes sure that we always perform iput outside of locked transactions. The only interesting change is making scoutfs_new_inode() return the allocated inode on error so that the caller can put the inode after releasing the transaction. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:29:20 -08:00
Zach Brown	d846eec5e8	Harden final inode deletion We were seeing a number of problems coming from races that allowed tasks in a mount to try and concurrently delete an inode's items. We could see error messages indicating that deletion failed with -ENOENT, we could see users of inodes behave erratically as inodes were deleted from under them, and we could see eventual server errors trying to merge overlapping data extents which were "freed" (add to transaction lists) multiple times. This commit addresses the problems in one relatively large patch. While we could mechanically split up the fixes, they're all interdependent and splitting them up (bisecting through them) could cause failures that would be devilishly hard to diagnose. First we stop allowing multiple cached vfs inodes. This was initially done to avoid deadlocks between lock invalidation and final inode deletion. We add a specific lookup that's used by invalidation which ignores any inodes which are in I_NEW or I_FREEING. Now that iget can wait on inode flags we call iget5_locked before acquiring the cluster lock. This ensures that we can only have one cached vfs inode for a given inode number in evict_inode trying to delete. Now that we can only have one cached inode, we can rework the omap tracking to use _set and _clear instead of _inc and _put. This isn't strictly necessary but is a simplification and lets us issue warnings if we see that we ever try to set an inode numbers bit on behalf of multiple cached inodes. We also add a _test helper. Orphan scanning would try to perform deletion by instantiating a cached inode and then putting it, triggering eviction and final deletion. This was an attempt to simplify concurrency but ended up causing more problems. It no longer tries to interact with inode cache at all and attempts to safely delete inode items directly. It uses the omap test to determine that it should skip an already cached inode. We had attempted to forbid opening inodes by handle if they had an nlink of 0. Since we allowed multiple cached inodes for an inode number this was to prevent adding cached inodes that were being deleted. It was only performing the check on newly allocated inodes, though, so it could get a reference to the cached inode that the scanner had inserted for deleting. We're chosing to keep restricting opening by handle to only linked inodes so we also check existing inodes after they're refreshed. We're left with a task evicting an inode and the orphan scanner racing to delete an inode's items. We move the work of determining if its safe to delete out of scoutfs_omap_should_delete() and into try_delete_inode_items() which is called directly from eviction and scanning. This is mostly code motion but we do make three critical changes. We get rid of the goofy concurrent deletion detection in delete_inode_items() and instead use a bit in the lock data to serialize multiple attempts to delete an inode's items. We no longer assume that the inode must still be around because we were called from evict and specifically check that inode item is still present for deleting. Finally, we use the omap test to discover that we shouldn't delete an inode that is locally cached (and would be not be included to the omap response). We do all this under the inode write lock to serialize between mounts. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-11 15:28:58 -08:00
Zach Brown	a67ea30bb7	Add orphan_scan_delay_ms mount option Add a mount option to set the delay betwen scanning of the orphan list. The sysfs file for the option is writable so this option can be set at run time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	15d7eec1f9	Disallow openening unlinked files by handle Our open by handle functions didn't care that the inode wasn't referenced and let tasks open unlinked inodes by number. This interacted badly with the inode deletion mechanisms which required that inodes couldn't be cached on other nodes after the transaction which removed their final reference. If a task did accidentally open a file by inode while it was being deleted it could see the inode items in an inconsistent state and return very confusing errors that look like corruption. The fix is to give the handle iget callers a flag to tell iget to only get the inode if it has a positive nlink. If iget sees that the inode has been unlinked it returns enoent. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	cff17a4cae	Remove unused flags scoutfs_inode_refresh arg The flags argument to scoutfs_inode_refresh wasn't being used. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e9b3cc873a	Export scoutfs_inode_init_key We're adding an ioctl that wants to build inode item keys so let's export the private inode key initializer. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e38beee85a	Stop using inode index key type as array index The code that updates inode index items on behalf of indexed fields uses an array to track changes in the fields. Those array indexes were the raw key type values. We're about to introduce some sparse space between all the key values so that we have some room to add keys in the future at arbitrary sort positions amongst the previous keys. We don't want the inode index item updating code to keep using raw types as array indices when the type values are no longer small dense values. We introduce indirection from type values to array indices to keep the tracking array in the in-memory inode struct small. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	94e5bc1457	Remove unused scoutfs_last_ino() Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	46edf82b6b	Add inode crtime creation time Add an inode creation time field. It's created for all new inodes. It's visible to stat_more. setattr_more can set it during restore. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-03 11:14:41 -07:00
Zach Brown	a4f5293e78	Flush invalidate and iput inode references We can be performing final deletion as inodes are evicted during unmount. We have to keep full locking, transactions, and networking up and running for the evict_inodes() call in generic_shutdown_super(). Unfortunately, this means that workers can be using inode references during evict_inodes() which prevents them from being evicted. Those workers can then remain running as we tear down the system, causing crashes and deadlocks as the final iputs try to use resources that have been destroyed. The fix is to first properly stop orphan scanning, which can instantiate new cached inodes, up before the call to kill_block_super ends up trying to evict all inodes. Then we just need to wait for any pending iput and invalidate work to finish and perform the final iput, which will always evict because generic_shutdown_super has cleared MS_ACTIVE. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	d6bed7181f	Remove almost all interruptible waits As subsystems were built I tended to use interruptible waits in the hope that we'd let users break out of most waits. The reality is that we have significant code paths that have trouble unwinding. Final inode deletion during iput->evict in a task is a good example. It's madness to have a pending signal turn an inode deletion from an efficient inline operation to a deferred background orphan inode scan deletion. It also happens that golang built pre-emptive thread scheduling around signals. Under load we see a surprising amount of signal spam and it has created surprising error cases which would have otherwise been fine. This changes waits to expect that IOs (including network commands) will complete reasonably promptly. We remove all interruptible waits with the notable exception of breaking out of a pending mount. That requires shuffling setup around a little bit so that the first network message we wait for is the lock for getting the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	c51f0c37da	Defer dirty inode data writeback (and use list) iput() can only be used in contexts that could perform final inode deletion which requires cluster locks and transactions. This is absolutely true for the transaction committing worker. We can't have deletion during transaction commit trying to get locks and dirty more items in the transaction. Now that we're properly getting locks in final inode deletion and O_TMPFILE support has put pressure on deletion, we're seeing deadlocks between inode eviction during transaction commit getting a index lock and index lock invalidation trying to commit. We use the newly offered queued iput to defer the iput from walking our dirty inodes. The transaction commit will be able to proceed while the iput worker is off waiting for a lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:20:40 -07:00
Zach Brown	52107424dd	Promote deferred iput to inode call Lock invalidation had the ability to kick iput off to work context. We need to use it for inode writeback as well so we move the mechanism over to inode.c and give it a proper call. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 11:34:52 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00
Zach Brown	603af327ac	Ignore I_FREEING in all inode hash lookups Previously we added a ilookup variant that ignored I_FREEING inodes to avoid a deadlock between lock invalidation (lock->I_FREEING) and eviction (I_FREEING->lock); Now we're seeing similar deadlocks between eviction (I_FREEING->lock) and fh_to_dentry's iget (lock->I_FREEING). I think it's reasonable to ignore all inodes with I_FREEING set when we're using our _test callback in ilookup or iget. We can remove the _nofreeing ilookup variant and move its I_FREEING test into the iget_test callback provided to both ilookup and iget. Callers will get the same result, it will just happen without waiting for a previously I_FREEING inode to leave. They'll get NULL instead of waiting from ilookup. They'll allocate and start to initialize a newer instance of the inode and insert it along side the previous instance. We don't have inode number re-use so we don't have the problem where a newly allocated inode number is relying on inode cache serialization to not find a previously allocated inode that is being evicted. This change does allow for concurrent iget of an inode number that is being deleted on a local node. This could happen in fh_to_dentry with a raw inode number. But this was already a problem between mounts because they don't have a shared inode cache to serialize them. Once we fix that between nodes, we fix it on a single node as well. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-28 12:22:10 -07:00
Zach Brown	4389c73c14	Fix deadlock between lock invalidate and evict We've had a long-standing deadlock between lock invalidation and eviction. Invalidating a lock wants to lookup inodes and drop their resources while blocking locks. Eviction wants to get a lock to perform final deletion while the inodes has I_FREEING set which blocks lookups. We only saw this deadlock a handful of times in all of the time we've run the code, but it's now much more common now that we're acquiring locks in iput to test that nlink is zero instead of only when nlink is zero. I see unmount hang regularly when testing final inode deletion. This adds a lookup variant for invalidation which will refuse to return freeing inodes so they won't be waited on. Once they're freeing they can't be seen by future lock users so they don't need to be invalidated. This keeps the lock invalication promise and avoids sleeping on freeing inodes which creates the deadlock. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	715c29aad3	Proactively drop dentry/inode caches outside locks Previously we wouldn't try and remove cached dentries and inodes as lock revocation removed cluster lock coverage. The next time we tried to use the cached dentries or inodes we'd acquire a lock and refresh them. But now cached inodes prevent final inode deletion. If they linger outside cluster locking then any final deletion will need to be deferred until all its cached inodes are naturally dropped at some point in the future across the cluster. It might take refreshing the dentries or for memory pressure to push out the old cached inodes. This tries to proctively drop cached dentries and inodes as we lose cluster lock coverage if they're not actively referenced. We need to be careful not to perform final inode deletion during lock invalidation because it will deadlock, so we defer an iput which could delete during evict out to async work. Now deletion can be done synchronously in the task that is performing the unlink because previous use of the inode on remote mounts hasn't left unused cached inodes sitting around. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	da1af9b841	Add scoutfs inode ino lock coverage Add lock coverage which tracks if the inode has been refreshed and is covered by the inode group cluster lock. This will be used by drop_inode and evict_inode to discover that the inode is current and doesn't need to be refreshed. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Andy Grover	0deb232d3f	Support O_TMPFILE and allow MOVE_BLOCKS into released extents Support O_TMPFILE: Create an unlinked file and put it on the orphan list. If it ever gains a link, take it off the orphan list. Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges. Ioctl callers must set a new flag to enable this operation mode. RH-compat: tmpfile support it actually backported by RH into 3.10 kernel. We need to use some of their kabi-maintaining wrappers to use it: use a struct inode_operations_wrapper instead of base struct inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets RH's modified vfs_tmpfile() find our tmpfile fn pointer. Add a test that tests both creating tmpfiles as well as moving their contents into a destination file via MOVE_BLOCKS. xfstests common/004 now runs because tmpfile is supported. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 14:23:44 -07:00
Andy Grover	bed33c7ffd	Remove item accounting Remove kmod/src/count.h Remove scoutfs_trans_track_item() Remove reserved/actual fields from scoutfs_reservation Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-20 17:01:08 -08:00
Zach Brown	807ae11ee9	Protect per-inode extent items with extent_sem Now that we have full precision extents a writer with i_mutex and a page lock can be modifying large extent items which cover much of the surrounding pages in the file. Readers can be in a different page with only the page lock and try to work with extent items as the writer is deleting and creating them. We add a per-inode rwsem which just protects file extent item manipulation. We try to acquire it as close to the item use as possible in data.c which is the only place we work with file extent items. This stops rare read corruption we were seeing where get_block in a reader was racing with extent item deletion in a stager at a further offset in the file. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-15 11:56:50 -08:00
Zach Brown	3a82090ab1	scoutfs: have per-fs inode nr allocators We had previously seen lock contention between mounts that were either resolving paths by looking up entries in directories or writing xattrs in file inodes as they did archiving work. The previous attempt to avoid this contention was to give each directory its own inode number allocator which ensured that inodes created for entries in the directory wouldn't share lock groups with inodes in other directories. But this creates the problem of operating on few files per lock for reasonably small directories. It also creates more server commits as each new directory gets its inode allocation reservation. The fix is to have mount-wide seperate allocators for directories and for everything else. This puts directories and files in seperate groups and locks, regardless of directory population. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	c010afa8ff	scoutfs: add setattr_more ioctl Add an ioctl that can be used by userspace to restore a file to its offline state. To do that it needs to set inode fields that are otherwise not exposed and create an offline extent. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:45:52 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	70b2a50c9a	scoutfs: remove individual online/offline calls Remove the functions that operate on online and offline blocks independently now that the file data mapping code isn't using it any more. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	036577890f	scoutfs: add atomic online/offline blocks calls Add functions that atomically change and query the online and offline block counts as a pair. They're semantically linked and we shouldn't present counts that don't match if they're in the process of being updated. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	c4de85fd82	scoutfs: cleanup xattr item storage Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting bug in getxattr(). We were unconditinally returning the max xattr value size when someone tried to probe an existing xattrs' value size by calling getxattr with size == 0. Some kernel paths did this to probe the existance of xattrs. They expected to get an error if the xattr didn't exist, but we were giving them the max possible size. This kernel path then tried to remove the xattrs with XATTR_REMOVE and that now failed and caused a bunch of errors in xfstests. The fix is to return the real xattr value size when getxattr is called with size == 0. To do that with the old format we'd have to iterate over all the items which happened to be pretty awkward in the current code paths. So we're taking this opportunity to land a change that had been brewing for a while. We now form the xattr keys from the hash of the name and the item values now store a logical contiquous header, the name, and the value. This makes it very easy for us to have the full xattr value length in the header and return it from getxattr when size == 0. Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE flags. And the code is a whole lot easier to follow. And we've removed another barrier for moving to small fixed size keys. Signed-off-by: Zach Brown <zab@versity.com>	2018-03-15 09:23:57 -07:00
Zach Brown	c1311783d5	scoutfs: add tracking of online and offline blocks Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	4ff1e3020f	scoutfs: allocate inode numbers per directory Having an inode number allocation pool in the super block meant that all allocations across the mount are interleaved. This means that concurrent file creation in different directories will create overlapping inode numbers. This leads to lock contention as reasonable work loads will tend to distribute work by directories. The easy fix is to have per-directory inode number allocation pools. We take the opportunity to clean up the network request so that the caller gets the allocation instead of having it be fed back in via a weird callback. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-09 17:58:19 -08:00
Zach Brown	a49061a7d9	scoutfs: remove the size index We aren't using the size index. It has runtime and code maintenance costs that aren't worth paying. Let's remove it. Removing it from the format and no longer maintaining it are straight forward. The bulk of this patch is actually the act of removing it from the index locking functions. We no longer have to predict the size that will be stored during the transaction to lock the index items that will be created during the transaction. A bunch of code to predict the size and then pass it into locking and transactions goes away. Like other inode fields we now update the size as it changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-01-30 15:03:35 -08:00
Mark Fasheh	20a22ddc6b	scoutfs: provide ->setattr Simple attr changes are mostly handled by the VFS, we just have to mirror them into our inode. Truncates are done in a seperate set of transactions. We use a flag to indicate an in-progress truncate. This allows us to detect and continue the truncate should the node crash. Index locking is a bit complicated, so we add a helper function to grab index locks and start a transaction. With this patch we now pass the following xfstests: generic/014 generic/101 generic/313 Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-18 13:23:01 -07:00
Mark Fasheh	dd99a0127e	scoutfs: rename scoutfs_inode_index_lock_hold Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind as part of normal (not an error) operation. This lets us re-use the name in an upcoming patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-18 13:23:01 -07:00
Zach Brown	856f257085	scoutfs: used locked getattr for all inodes We only set the .getattr method to our locked getattr filler for regular files. Set it for all files so that stat, etc, will see the current inode for all file types. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-12 14:51:30 -07:00
Zach Brown	950436461a	scoutfs: add lock coverage for inode index items Add lock coverage for inode index items. Sadly, this isn't trivial. We have to predict the value of the indexed fields before the operation to lock those items. One value in particular we can't reliably predict: the sequence of the transaction we enter after locking. Also operations can create an absolute ton of index item updates -- rename can modify nr_inodes * items_per_inode * 2 items, so maybe 24 today. And these items can be arbitrarily positioned in the key space. So to handle all this we add functions to gather predicted item values we'll need to lock sort and lock them all, then pass appropriate locks down to the item functions during inode updates. The trickiest bit of the index locking code is having to retry if the sequence number changes. Preparing locks has to guess the sequence number of its upcoming trans and then makes item update decisions based on that. If we enter and have a different sequence number then we need to back off and retry with the correct sequence number (we may find that we'll need to update the indexed meta seq and need to have it locked). The use of the functions is straight forward. Sites figure out the predicted sizes, lock, pass the locks to inode updates, and unlock. While we're at it we replace the individual item field tracking variables in the inode info with an array of indexed values. The code ends up a bit nicer. It also gets rid of the indexed time fields that were left behind and were unused. It's worth noting that we're getting exclusive locks on the index updates. Locking the meta/data seq updates results in complete global serialization of all changes. We'll need concurrent writer locks to get concurrency back. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	aa70903154	scoutfs: add lock coverage for data paths Use per_task storage on the inode to pass locks from high level read and write lock holders down into the callbacks that operate under the locks so that the locks can then be passed to the item functions. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	0535e249d1	scoutfs: add lock arg to scoutfs_update_inode_item Add a full lock argument to scoutfs_update_inode_item() and use it to pass the lock's end key into item_update(). This'll get changed into passing the full lock into _update soon. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	32a68e84cf	scoutfs: add full lock coverage to _item_dirty() Add the full lock argument to _item_dirty() so that it can verify lock coverage in addition to limiting item cache population to the range covered by the lock. This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper around _item_dirty(); Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	8735d319a3	scoutfs: fix inode lock inversions We lock multiple inodes by order of their inode number. This fixes the directory entry paths that hold parent dir and target inode locks. Link and unlink are easy because they just acquire the existing parent dir and target inode locks. Lookup is a little squirrely because we don't want to try and order the parent dir lock with locks down in iget. It turns out that it's safe to drop the dir lock before calling iget as long as iget handles racing the inode cache instantiation with inode deletion. Creation is the remaining pattern and it's a little weird because we want to lock the newly created inode before we create it and the items that store it. We add a function that correctly orders the locks, transaction, and inode cache instantiation. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Mark Fasheh	c0d3f99a6e	scoutfs: Cluster coherent read/write With trylock implemented we can add locking in readpage. After that it's pretty easy to implement our own read/write functions which at this point are more or less wrapping the kernel helpers in the correct cluster locking. Data invalidation is a bit interesting. If the lock we are invalidating is an inode group lock, we use the lock boundaries to incrementally search our inode cache. When an inode struct is found, we sync and (optionally) truncate pages. Signed-off-by: Mark Fasheh <mfasheh@versity.com> [zab: adapted to newer lock call, fixed some error handling] Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00

1 2

77 Commits