scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 21:03:12 +00:00

Author	SHA1	Message	Date
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00
Zach Brown	603af327ac	Ignore I_FREEING in all inode hash lookups Previously we added a ilookup variant that ignored I_FREEING inodes to avoid a deadlock between lock invalidation (lock->I_FREEING) and eviction (I_FREEING->lock); Now we're seeing similar deadlocks between eviction (I_FREEING->lock) and fh_to_dentry's iget (lock->I_FREEING). I think it's reasonable to ignore all inodes with I_FREEING set when we're using our _test callback in ilookup or iget. We can remove the _nofreeing ilookup variant and move its I_FREEING test into the iget_test callback provided to both ilookup and iget. Callers will get the same result, it will just happen without waiting for a previously I_FREEING inode to leave. They'll get NULL instead of waiting from ilookup. They'll allocate and start to initialize a newer instance of the inode and insert it along side the previous instance. We don't have inode number re-use so we don't have the problem where a newly allocated inode number is relying on inode cache serialization to not find a previously allocated inode that is being evicted. This change does allow for concurrent iget of an inode number that is being deleted on a local node. This could happen in fh_to_dentry with a raw inode number. But this was already a problem between mounts because they don't have a shared inode cache to serialize them. Once we fix that between nodes, we fix it on a single node as well. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-28 12:22:10 -07:00
Zach Brown	4389c73c14	Fix deadlock between lock invalidate and evict We've had a long-standing deadlock between lock invalidation and eviction. Invalidating a lock wants to lookup inodes and drop their resources while blocking locks. Eviction wants to get a lock to perform final deletion while the inodes has I_FREEING set which blocks lookups. We only saw this deadlock a handful of times in all of the time we've run the code, but it's now much more common now that we're acquiring locks in iput to test that nlink is zero instead of only when nlink is zero. I see unmount hang regularly when testing final inode deletion. This adds a lookup variant for invalidation which will refuse to return freeing inodes so they won't be waited on. Once they're freeing they can't be seen by future lock users so they don't need to be invalidated. This keeps the lock invalication promise and avoids sleeping on freeing inodes which creates the deadlock. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	715c29aad3	Proactively drop dentry/inode caches outside locks Previously we wouldn't try and remove cached dentries and inodes as lock revocation removed cluster lock coverage. The next time we tried to use the cached dentries or inodes we'd acquire a lock and refresh them. But now cached inodes prevent final inode deletion. If they linger outside cluster locking then any final deletion will need to be deferred until all its cached inodes are naturally dropped at some point in the future across the cluster. It might take refreshing the dentries or for memory pressure to push out the old cached inodes. This tries to proctively drop cached dentries and inodes as we lose cluster lock coverage if they're not actively referenced. We need to be careful not to perform final inode deletion during lock invalidation because it will deadlock, so we defer an iput which could delete during evict out to async work. Now deletion can be done synchronously in the task that is performing the unlink because previous use of the inode on remote mounts hasn't left unused cached inodes sitting around. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	22371fe5bd	Fully destroy inodes after all mounts evict Today an inode's items are deleted once its nlink reaches zero and the final iput is called in a local mount. This can delete inodes from under other mounts which have opened the inode before it was unlinked on another mount. We fix this by adding cached inode tracking. Each mount maintains groups of cached inode bitmaps at the same granularity as inode locking. As a mount performs its final iput it gets a bitmap from the server which indicates if any other mount has inodes in the group open. This makes the two fast paths of opening and closing linked files and of deleting a file that was unlinked locally only pay a moderate cost of either maintaining the bitmap locally and only getting the open map once per lock group. Removing many files in a group will only lock and get the open map once per group. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	da1af9b841	Add scoutfs inode ino lock coverage Add lock coverage which tracks if the inode has been refreshed and is covered by the inode group cluster lock. This will be used by drop_inode and evict_inode to discover that the inode is current and doesn't need to be refreshed. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Andy Grover	0deb232d3f	Support O_TMPFILE and allow MOVE_BLOCKS into released extents Support O_TMPFILE: Create an unlinked file and put it on the orphan list. If it ever gains a link, take it off the orphan list. Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges. Ioctl callers must set a new flag to enable this operation mode. RH-compat: tmpfile support it actually backported by RH into 3.10 kernel. We need to use some of their kabi-maintaining wrappers to use it: use a struct inode_operations_wrapper instead of base struct inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets RH's modified vfs_tmpfile() find our tmpfile fn pointer. Add a test that tests both creating tmpfiles as well as moving their contents into a destination file via MOVE_BLOCKS. xfstests common/004 now runs because tmpfile is supported. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 14:23:44 -07:00
Zach Brown	f8d39610a2	Only get inode writeback_lock when adding inodes Each transaction maintains a global list of inodes to sync. It checks the inode and adds it in each write_end call per OS page. Locking and unlocking the global spinlock was showing up in profiles. At the very least, we can only get the lock once per large file that's written during a transaction. This will reduce spinlock traffic on the lock by the number of pages written per file. We'll want a better solution in the long run, but this helps for now. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Andy Grover	bed33c7ffd	Remove item accounting Remove kmod/src/count.h Remove scoutfs_trans_track_item() Remove reserved/actual fields from scoutfs_reservation Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-20 17:01:08 -08:00
Zach Brown	1e0f8ee27a	Finally change all 'ci' inode info ptrs to 'si' Finally get rid of the last silly vestige of the ancient 'ci' name and update the scoutfs_inode_info pointers to si. This is just a global search and replace, nothing functional changes. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-15 15:20:02 -08:00
Zach Brown	807ae11ee9	Protect per-inode extent items with extent_sem Now that we have full precision extents a writer with i_mutex and a page lock can be modifying large extent items which cover much of the surrounding pages in the file. Readers can be in a different page with only the page lock and try to work with extent items as the writer is deleting and creating them. We add a per-inode rwsem which just protects file extent item manipulation. We try to acquire it as close to the item use as possible in data.c which is the only place we work with file extent items. This stops rare read corruption we were seeing where get_block in a reader was racing with extent item deletion in a stager at a further offset in the file. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-15 11:56:50 -08:00
Andy Grover	e6228ead73	scoutfs: Ensure padding in structs remains zeroed Audit code for structs allocated on stack without initialization, or using kmalloc() instead of kzalloc(). - avl.c: zero padding in avl_node on insert. - btree.c: Verify item padding is zero, or WARN_ONCE. - inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding. - net.c: zero pad in net header. - net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin(). - xattr.c: scoutfs_xattr has padding, zero it. - forest.c: item_root in forest_next_hint() appears to either be assigned-to or unused, so no need to zero it. - key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones} Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	6bacd95aea	scoutfs: fs uses item cache instead of forest Use the new item cache for all the item work in the fs instead of calling into the forest of btrees. Most of this is mechanical conversion from the _forest calls to the _item calls. The item cache no longer supports the kvec argument for describing values so all the callers pass in the value pointer and length directly. The item cache doesn't support saving items as they're deleted and later restoring them from an error unwinding path. There were only two users of this. Directory entries can easily guarantee that deletion won't fail by dirtying the items first in the item cache. Xattr updates were a little trickier. They can combine dirtying, creating, updating, and deleting to atomically switch between items that describe different versions of a multi-item value. This also fixed a bug in the srch xattrs where replacing an xattr would create a new id for the xattr and leave existing srch items referencing a now deleted id. Replacing now reuses the old id. And finally we add back in the locking and transaction item cache integration. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	3a82090ab1	scoutfs: have per-fs inode nr allocators We had previously seen lock contention between mounts that were either resolving paths by looking up entries in directories or writing xattrs in file inodes as they did archiving work. The previous attempt to avoid this contention was to give each directory its own inode number allocator which ensured that inodes created for entries in the directory wouldn't share lock groups with inodes in other directories. But this creates the problem of operating on few files per lock for reasonably small directories. It also creates more server commits as each new directory gets its inode allocation reservation. The fix is to have mount-wide seperate allocators for directories and for everything else. This puts directories and files in seperate groups and locks, regardless of directory population. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	177af7f746	scoutfs: use larger metadata blocks Introduce different constants for small and large metadata block sizes. The small 4KB size is used for the super block, quorum blocks, and as the granularity of file data block allocation. The larger 64KB size is used for the radix, btree, and forest bloom metadata block structures. The bulk of this are obvious transitions from the old single constant to the appropriate new constant. But there are a few more involved changes, though just barely. The block crc calculation now needs the caller to pass in the size of the block. The radix function to return free bytes instead returns free blocks and the caller is responsible for knowing how big its managed blocks are. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	48448d3926	scoutfs: convert fs callers to forest Convert fs callers to work with the btree forest calls instead of the lsm item cache calls. This is mostly a mechanical syntax conversion. The inode dirtying path does now update the item rather than simply dirtying it. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	6f5cfd8cc2	scoutfs: use rid instead of node_id in items Use the mount's generated random id in persistent items and the lock that protects them instead of the assigned node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	c010afa8ff	scoutfs: add setattr_more ioctl Add an ioctl that can be used by userspace to restore a file to its offline state. To do that it needs to set inode fields that are otherwise not exposed and create an offline extent. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:45:52 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	59170f41b1	scoutfs: revive item deletion path The inode deletion path had bit rotted. Delete the ifdefs that were stopping it from deleting all the items associated with an inode. There can be a lot of xattr and data mapping items so we have them manage their own transactions (data already did). The xattr deletion code was trying to get a lock while the caller already held it so delete that. Then we accurately account for the small number of remaining items that finally delete the inode. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	600ecd9fad	scoutfs: adapt to fallcated extents The addition of fallocate() now means that offline extents can be unwritten and allocated and that extents can now be found outside of i_size. Truncating needs to know about the possible flag combinations, writing preallocation needs to know to update an existing extent or allocate up to the next extent, get_block can't map unwritten extents for read, extent conversion needs to also clear offline, and truncate needs to drop extents outside i_size even if truncating to the existing file size. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	27d1f3bcf7	scoutfs: inode read shouldn't modify online blocks There was a typo in the addition of i_blocks tracking that would set online blocks to the value of offline blocks when reading an existing inode into memory. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	19f7e0284b	scoutfs: add online/offline block trace event Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	70b2a50c9a	scoutfs: remove individual online/offline calls Remove the functions that operate on online and offline blocks independently now that the file data mapping code isn't using it any more. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	036577890f	scoutfs: add atomic online/offline blocks calls Add functions that atomically change and query the online and offline block counts as a pair. They're semantically linked and we shouldn't present counts that don't match if they're in the process of being updated. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	966c8b8cbc	scoutfs: alloc inos at multiple of lock group Inode allocations come from batches that are reserved for directories. As the batch is exhausted a new one is acquired and allocated from. The batch size was arbitrarily set to the human friendly 10000. This doesn't interact well with the lock group size being a power of two. Each allocation batch will straddle an inode group with its previous and next inode batch. This often doesn't matter because dirctories very rarely have more than 9000 entries. But as entries pass 10000 they'd see surprising contention with other inode ranges in directories. Tweak the allocation size to be a multiple of the lock group size to stop this from happening. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:53:58 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	b0bd273acc	scoutfs: remove support for multi-element kvecs Originally the item interfaces were written with full support for vectored keys and values. Callers constructed keys and values made up of header structs and data buffers. Segments supported much larger values which could span pages when stored in memory. But over time we've pulled that support back. Keys are described by a key struct instead of a multi-element kvec. Values are now much smaller and don't span pages. The item interfaces still use the kvec arrays but everyone only uses a single element. So let's make the world a whole lot less awful but having the item interfaces only supporting a single value buffer specified by a kvec. A bunch of code disappears and the result is much easier to understand. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	08f544cc15	scoutfs: remove scoutfs_item_lookup_exact() size Every caller of scoutfs_item_lookup_exact() provided a size that matches the value buffer. Let's remove the redundant arg and use the value buffer length as the exact size to match. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	c4de85fd82	scoutfs: cleanup xattr item storage Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting bug in getxattr(). We were unconditinally returning the max xattr value size when someone tried to probe an existing xattrs' value size by calling getxattr with size == 0. Some kernel paths did this to probe the existance of xattrs. They expected to get an error if the xattr didn't exist, but we were giving them the max possible size. This kernel path then tried to remove the xattrs with XATTR_REMOVE and that now failed and caused a bunch of errors in xfstests. The fix is to return the real xattr value size when getxattr is called with size == 0. To do that with the old format we'd have to iterate over all the items which happened to be pretty awkward in the current code paths. So we're taking this opportunity to land a change that had been brewing for a while. We now form the xattr keys from the hash of the name and the item values now store a logical contiquous header, the name, and the value. This makes it very easy for us to have the full xattr value length in the header and return it from getxattr when size == 0. Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE flags. And the code is a whole lot easier to follow. And we've removed another barrier for moving to small fixed size keys. Signed-off-by: Zach Brown <zab@versity.com>	2018-03-15 09:23:57 -07:00
Zach Brown	302b0f5316	scoutfs: track inode 512b block count We weren't doing anything with the inode blocks field. We weren't even initializing it which explains why we'd sometimes see garbage i_blocks values in scoutfs inodes in segments. The logical blocks field reflects the contents of the file regardless of whether its online or not. It's the sum of our online and offline block tracking. So we can initialize it to our persistent online and offline counts and then keep it in sync as blocks are allocated and freed. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	c1311783d5	scoutfs: add tracking of online and offline blocks Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	f52dc28322	scoutfs: simplify lock use of kernel dlm We had an excessive number of layers between scoutfs and the dlm code in the kernel. We had dlmglue, the scoutfs locks, and task refs. Each layer had structs that track the lifetime of the layer below it. We were about to add another layer to hold on to locks just a bit longer so that we can avoid down conversion and transaction commit storms under contention. This collapses all those layers into simple state machine in lock.c that manages the mode of dlm locks on behalf of the file system. The users of the lock interface are mainly unchanged. We did change from a heavier trylock to a lighter nonblock lock attempt and have to change the single rare readpage use. Lock fields change so a few external users of those fields change. This not only removes a lot of code it also contains functional improvements. For example, it can now convert directly to CW locks with a single lock request instead of having to use two by first converting to NL. It introduces the concept of an unlock grace period. Locks won't be dropped on behalf of other nodes soon after being unlocked so that tasks have a chance to batch up work before the other node gets a chance. This can result in two orders of magnitude improvements in the time it takes to, say, change a set of xattrs on the same file population from two nodes concurrently. There are significant changes to trace points, counters, and debug files that follow the implementation changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-14 15:00:17 -08:00
Zach Brown	4ff1e3020f	scoutfs: allocate inode numbers per directory Having an inode number allocation pool in the super block meant that all allocations across the mount are interleaved. This means that concurrent file creation in different directories will create overlapping inode numbers. This leads to lock contention as reasonable work loads will tend to distribute work by directories. The easy fix is to have per-directory inode number allocation pools. We take the opportunity to clean up the network request so that the caller gets the allocation instead of having it be fed back in via a weird callback. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-09 17:58:19 -08:00
Zach Brown	a49061a7d9	scoutfs: remove the size index We aren't using the size index. It has runtime and code maintenance costs that aren't worth paying. Let's remove it. Removing it from the format and no longer maintaining it are straight forward. The bulk of this patch is actually the act of removing it from the index locking functions. We no longer have to predict the size that will be stored during the transaction to lock the index items that will be created during the transaction. A bunch of code to predict the size and then pass it into locking and transactions goes away. Like other inode fields we now update the size as it changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-01-30 15:03:35 -08:00
Zach Brown	c36d90e216	scoutfs: map inode index item locks in one place We have to map many index item keys down to a lock that then has a start and end key range. We also use this mapping over in index item locking to avoid trying to acquire locks multiple times. We were duplicating the mapping calculation in these two places. This refactors these functions to use one range calculation function. It's going to be used in future patches to fix the mapping of the size index items. This should result in no functional changes. Signed-off-by: Zach Brown <zab@versity.com>	2017-11-21 13:11:43 -08:00
Mark Fasheh	e8f87ff90a	scoutfs: use CW locks for inode index updates This will give us concurrency yet still allow our ioctls to drive cache syncing/invalidation on other nodes. Our lock_coverage() checks evolve to handle direct dlm modes, allowing us to verify correct usage of CW locks. As a test, we can run createmany on two nodes at the same time, each working in their own directory. The following commands were run on each node: $ mkdir /scoutfs/`uname -n` $ cd /scoutfs/`uname -n` $ /root/createmany -o ./file_$i 100000 Before this patch that test wouldn't finish in any reasonable amount of time and I would kill it after some number of hours. After this patch, we make swift progress through the test: [root@fstest3 fstest3.site]# /root/createmany -o ./file_$i 100000 - created 10000 (time 1509394646.11 total 0.31 last 0.31) - created 20000 (time 1509394646.38 total 0.59 last 0.28) - created 30000 (time 1509394646.81 total 1.01 last 0.43) - created 40000 (time 1509394647.31 total 1.51 last 0.50) - created 50000 (time 1509394647.82 total 2.02 last 0.51) - created 60000 (time 1509394648.40 total 2.60 last 0.58) - created 70000 (time 1509394649.06 total 3.26 last 0.66) - created 80000 (time 1509394649.72 total 3.93 last 0.66) - created 90000 (time 1509394650.36 total 4.56 last 0.64) total: 100000 creates in 35.02 seconds: 2855.80 creates/second [root@fstest4 fstest4.fstestnet]# /root/createmany -o ./file_$i 100000 - created 10000 (time 1509394647.35 total 0.75 last 0.75) - created 20000 (time 1509394647.89 total 1.28 last 0.54) - created 30000 (time 1509394648.46 total 1.86 last 0.58) - created 40000 (time 1509394648.96 total 2.35 last 0.49) - created 50000 (time 1509394649.51 total 2.90 last 0.55) - created 60000 (time 1509394650.07 total 3.46 last 0.56) - created 70000 (time 1509394650.79 total 4.19 last 0.72) - created 80000 (time 1509394681.26 total 34.66 last 30.47) - created 90000 (time 1509394681.63 total 35.03 last 0.37) total: 100000 creates in 35.50 seconds: 2816.76 creates/second Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-11-16 16:14:38 -08:00
Zach Brown	5c3962d223	scoutfs: trace correct index item deletion The trace point for deleting index items was using the wrong major and minor. Signed-off-by: Zach Brown <zab@versity.com>	2017-11-08 13:37:16 -08:00
Mark Fasheh	20a22ddc6b	scoutfs: provide ->setattr Simple attr changes are mostly handled by the VFS, we just have to mirror them into our inode. Truncates are done in a seperate set of transactions. We use a flag to indicate an in-progress truncate. This allows us to detect and continue the truncate should the node crash. Index locking is a bit complicated, so we add a helper function to grab index locks and start a transaction. With this patch we now pass the following xfstests: generic/014 generic/101 generic/313 Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-18 13:23:01 -07:00
Mark Fasheh	dd99a0127e	scoutfs: rename scoutfs_inode_index_lock_hold Call it scoutfs_inode_index_try_lock_hold since it may fail and unwind as part of normal (not an error) operation. This lets us re-use the name in an upcoming patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-10-18 13:23:01 -07:00
Zach Brown	856f257085	scoutfs: used locked getattr for all inodes We only set the .getattr method to our locked getattr filler for regular files. Set it for all files so that stat, etc, will see the current inode for all file types. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-12 14:51:30 -07:00
Zach Brown	9b31c9795b	scoutfs: add full lock arg to _item_delete() Add the full lock arg to _item_delete() so that it can verify lock coverage. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	6cd64f3228	scoutfs: add full lock arg to _item_update() Add the full lock arg to _item_update() so that it can verify lock coverage. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	0aa16f5ef6	scoutfs: add lock arg to _item_create() scoutfs_item_create() hasn't been working with lock coverage. It wouldn't return -ENOENT if it didn't have the lock cached. It would create items outside lock coverate so they wouldn't be invalidated and re-read if another node modified the item. Add a lock arg and teach it to populate the cache so that it's correctly consistent. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	950436461a	scoutfs: add lock coverage for inode index items Add lock coverage for inode index items. Sadly, this isn't trivial. We have to predict the value of the indexed fields before the operation to lock those items. One value in particular we can't reliably predict: the sequence of the transaction we enter after locking. Also operations can create an absolute ton of index item updates -- rename can modify nr_inodes * items_per_inode * 2 items, so maybe 24 today. And these items can be arbitrarily positioned in the key space. So to handle all this we add functions to gather predicted item values we'll need to lock sort and lock them all, then pass appropriate locks down to the item functions during inode updates. The trickiest bit of the index locking code is having to retry if the sequence number changes. Preparing locks has to guess the sequence number of its upcoming trans and then makes item update decisions based on that. If we enter and have a different sequence number then we need to back off and retry with the correct sequence number (we may find that we'll need to update the indexed meta seq and need to have it locked). The use of the functions is straight forward. Sites figure out the predicted sizes, lock, pass the locks to inode updates, and unlock. While we're at it we replace the individual item field tracking variables in the inode info with an array of indexed values. The code ends up a bit nicer. It also gets rid of the indexed time fields that were left behind and were unused. It's worth noting that we're getting exclusive locks on the index updates. Locking the meta/data seq updates results in complete global serialization of all changes. We'll need concurrent writer locks to get concurrency back. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	aa70903154	scoutfs: add lock coverage for data paths Use per_task storage on the inode to pass locks from high level read and write lock holders down into the callbacks that operate under the locks so that the locks can then be passed to the item functions. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	0535e249d1	scoutfs: add lock arg to scoutfs_update_inode_item Add a full lock argument to scoutfs_update_inode_item() and use it to pass the lock's end key into item_update(). This'll get changed into passing the full lock into _update soon. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	32a68e84cf	scoutfs: add full lock coverage to _item_dirty() Add the full lock argument to _item_dirty() so that it can verify lock coverage in addition to limiting item cache population to the range covered by the lock. This also ropes in scoutfs_dirty_inode_item() which is a thin wrapper around _item_dirty(); Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00
Zach Brown	1c6e3e39bf	scoutfs: add full lock coverage to _item_next() Add the full lock argument to _item_next() so that it can verify lock coverage in addition to limiting item cache population to the range covered by the lock. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-09 15:31:29 -07:00

1 2 3

119 Commits