scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-07-23 08:23:15 +00:00

Author	SHA1	Message	Date
Auke Kok	a14da52cbb	kernel_getsockname and kernel_getpeername dropped addrlen arg. v4.16-rc1-1-g9b2c45d479d0 This interface now returns (sizeof (addr)) on success, instead of 0. Therefore, we have to change the error condition detection. The compat for older kernels handles the addrlen check internally. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Auke Kok	f367e485a6	xattr functions are now passed flags through struct xattr_handler Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Auke Kok	8a7bc0cdfa	Remove the use of backing_dev_info pt from address_space. Instead, use the new inline inode_to_bdi from <backing-dev.h> to fill in the task's backing_dev_info. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Auke Kok	e81d16f8db	Do not use MS_* flags anymore in kernel space. MS_* flags from <linux/mount.h> should not be used in the kernel anymore from 4.x onwards. Instead, we need to use the SB_* versions Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Zach Brown	bad0455e28	Use count/scan objects shrinking interface Move to the more recent interfaces for counting and scanning cached objects to shrink. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Auke Kok	0a30c0b926	Use page->lru instead of page->list With v3.14-rc1-10-g34bf6ef94a83, page->list is removed Instead, use the union member ->lru. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Zach Brown	84a4000c85	Use more modern bio interfaces Move towards modern bio intefaces, while unfortunately carrying along a bunch of compat functions that let us still work with the old incompatible interfaces. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Zach Brown	859f63e49b	Use memalloc_nofs_save memalloc_nofs_save() was introduced as preferential to trying to use GFP flags to indicate that a task should not recurse during reclaim. We use it instead of the _noio_ we were using before. Signed-off-by: Zach Brown <zab@versity.com>	2023-08-01 16:35:48 -04:00
Zach Brown	588bdb7969	Use percpu_counter_add_batch __percpu_counter_add_batch was renamed to make it clear that the __ doesn't mean it's less safe, as it means in other calls in the API, but just that it takes an additional parameter. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:48 -04:00
Auke Kok	b894f6b04c	Use __posix_acl_create/_chmod and add backwards compatibility There are new interfaces available but the old one has been retained for us to use. In case of older kernels, we will need to fall back to the previous name of these functions. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:35:46 -04:00
Auke Kok	e26573ae8e	Fix argument test for __posix_acl_valid. The argument is fixed to be user_namespace, instead of user_ns. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:34:50 -04:00
Auke Kok	3f6b98496f	Use setattr_preapre() as inode_change_ok() was removed in v4.8-rc1 Instead, we can call setattr_prepare() directly. We provide a fallback for older kernels. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:34:49 -04:00
Auke Kok	b8a378ede7	Use the new inode->i_version manipulation methods. Provide fallback in degraded mode for kernels pre-v4.15-rc3 by directly manipulating the member as needed. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:33:28 -04:00
Auke Kok	4b08e79988	inode->i_mutex has been replaced with inode->i_rwsem. Since v4.6-rc3-27-g9902af79c01a, inode->i_mutex has been replaced with ->i_rwsem. However, long since whenever, inode_lock() and related functions already worked as intended and provided fully exclusive locking to the inode. To avoid a name clash on pre-rhel8 kernels, we have to rename a stack variable in `src/file.c`. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:33:28 -04:00
Auke Kok	2ac28c4969	New inode->i_version API requires <iversion.h> Since v4.15-rc3-4-gae5e165d855d, <linux/iversion.h> contains a new inode->i_version API and it is not included by default. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:33:28 -04:00
Auke Kok	3608d1aae1	use $(MAKE) to allow passing jobserver flags. With this, we can `make -jX` to speed up compiles a bit from the kmod folder. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:33:28 -04:00
Auke Kok	f13757f0af	module_init/_exit should have a semicolon at eol. In the past this was not needed but since el7 onwards these macros should require the semicolon. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:33:28 -04:00
Auke Kok	34e6efd39c	Adjust for new augmented rbtree compute callback function signature The new variant of the code that recomputes the augmented value is designed to handle non-scalar types and to facilitate that, it has new semantics for the _compute callback. It is now passed a boolean flag `exit` that indicates that if the value isn't changed, it should exit and halt propagation. The callback function now shall return whether that propagation should stop or not, and not the computed new value. The callback can now directly update the new computed value in the node. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 16:30:16 -04:00
Auke Kok	b452ca3d23	Add include <blkdev.h>. Fixes: Error: implicit declaration of function ‘blkdev_put’ Previously this was an `extern` in <fs.h> and included implicitly, hence the need to hard include it now. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	090c795b7e	preempt_mask.h is removed entirely. v4.1-rc4-22-g92cf211874e9 merges this into preempt.h, and on rhel7 kernels we don't need this include anymore either. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	d9394cb084	page_cache_release() is removed. put_page() instead. Even in 3.x, this already was equivalent. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	67ae352618	flush_work_sync is equivalent to flush_work. v3.15-rc1-6-g1a56f2aa4752 removes flush_work_sync entirely, but ever since v3.6-rc1-25-g606a5020b9bd which made all workqueues non-reentrant, it has been equivalent to flush_work. This is safe because in all cases only one server->work can be in flight at a time. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	38bb5a8254	d_materialise_unique replaced with d_splice_alias. Note argument order reversal. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	2510688a36	READ_ONCE() replaces ACCESS_ONCE() v3.18-rc3-2-g230fa253df63 forces us to remove ACCESS_ONCE() with READ_ONCE(), but it is probably the better interface and works with non-scalar types. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	15a5dca8c6	PAGE_CACHE_SIZE was removed, replace with PAGE_SIZE. PAGE_CACHE_SIZE was previously defined to be equivalent to PAGE_SIZE. This symbol was removed in v4.6-rc1-32-g1fa64f198b9f. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Auke Kok	c3996cb021	Include kernel.h and fs.h at the top of kernelcompat.h Because we `-include src/kernelcompat.h` from the command line, this header gets included before any of the kernel includes in most .c and .h files. We should at least make sure we pull in <fs> and <kernel> since they're required. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-08-01 13:40:59 -04:00
Zach Brown	1672b3ecec	Merge pull request #130 from versity/zab/noncontig_alloc_einval Fix partial preallocation when _contig_only = 0	2023-07-17 10:21:18 -07:00
Zach Brown	55f9435fad	Fix partial preallocation when _contig_only = 0 Data preallocation attempts to allocate large aligned regions of extents. It tried to fill the hole around a write offset that didn't contain an extent. It missed the case where there can be multiple extents between the start of the region and the hole. It could try to overwrite these additional existing extents and writes could return EINVAL. We fix this by trimming the preallocation to start at the write offset if there are any extents in the region before the write offset. The data preallocation test output has to be updated now that allocation extents won't grow towards the start of the region when there are existing extents. Signed-off-by: Zach Brown <zab@versity.com>	2023-07-17 09:36:09 -07:00
Zach Brown	8a64b46a2f	Process log merge splicing in many commits Log merge completions were spliced in one server commit. It's possible to get enough completion work pending that it all can't be completed in one server commit. Operations fail with ENOSPC and because these changes can't be unwound cleanly the server asserts. This allows the completion splicing to break the work up into multiple commits. Processing completions in multiple commits means that request creation can observe the merge status in states that weren't possible before. Splicing is careful to maintain an elevated nr_complete count while the client can't get requests because the tree is rebalancing. Signed-off-by: Zach Brown <zab@versity.com>	2023-07-14 13:28:29 -07:00
Zach Brown	a9da27444f	Merge pull request #128 from versity/zab/prealloc_fragmentation Zab/prealloc fragmentation	2023-06-29 09:57:32 -07:00
Zach Brown	49fe89741d	Merge pull request #125 from versity/zab/get_referring_entries Zab/get referring entries	2023-06-29 09:57:06 -07:00
Zach Brown	847916860d	Advance move_blocks extent search offset The move_blocks ioctl finds extents to move in the source file by searching from the starting block offset of the region to move. Logically, this is fine. After each extent item is deleted the next search will find the next extent. The problem is that deleted items still exist in the item cache. The next iteration has to skip over all the deleted extents from the start of the region. This is fine with large extents, but with heavily fragmented extents this creates a huge amplification of the number of items to traverse when moving the fragmented extents in a large file. (It's not quite O(n^2)/2 for the total extents, deleted items are purged as we write out the dirty items in each transaction.. but it's still immense.) The fix is to simply start searching for the next extent after the one we just moved. Signed-off-by: Zach Brown <zab@versity.com>	2023-06-28 16:54:28 -07:00
Zach Brown	3d99fda0f6	Preallocate data around iblock when noncontig If the _contig_only option isn't set then we try to preallocate aligned regions of files. The initial implementation naively only allowed one preallocation attempt in each aligned region. If it got a small allocation that didn't fill the region then every future allocation in the region would be a single block. This changes every preallocation in the region to attempt to fill the hole in the region that iblock fell in. It uses an extra extent search (item cache search) to try and avoid thousands of single block allocations. Signed-off-by: Zach Brown <zab@versity.com>	2023-06-28 12:21:25 -07:00
Zach Brown	acafb869e7	Avoid deadlock from block reclaim in rht resize The RCU hash table uses deferred work to resize the hash table. There's a time during resize when hash table iteration will return EAGAIN until resize makes more progress. During this time resize can perform GFP_KERNEL allocations. Our shrinker tries to iterate over its RCU hash table to find blocks to reclaim. It tries to restart iteration if it gets EAGAIN on the assumption that it will be usable again soon. Combine the two and our shrinker can get stuck retrying iteration indefinitely because it's shrinking on behalf of the hash table resizing that is trying to allocate the next table before making iteration work again. We have to stop shrinking in this case so that the resizing caller can proceed. Signed-off-by: Zach Brown <zab@versity.com>	2023-06-15 14:45:26 -07:00
Zach Brown	707752a7bf	Add get_referring_entries ioctl Add an ioctl that gives the callers all entries that refer to an inode. It's like a backwards readdir. It's a light bit of translation between the internal _add_next_linkrefs() list of entries and the ioctl interface of a buffer of entry structs. Signed-off-by: Zach Brown <zab@versity.com>	2023-06-14 14:12:10 -07:00
Zach Brown	0316c22026	Extend scoutfs_dir_add_next_linkrefs Extend scoutfs_dir_add_next_linkref() to be able to return multiple backrefs under the lock for each call and have it take an argument to limit the number of backrefs that can be added and returned. Its return code changes a bit in that it returns 1 on success instead of 0 so we have to be a little careful with callers who were expecting 0. It still returns -ENOENT when no entries are found. We break up its tracepoint into one that records each entry added and one that records the result of each call. This will be used by an ioctl to give callers just the entries that point to an inode instead of assembling full paths from the root. Signed-off-by: Zach Brown <zab@versity.com>	2023-06-14 14:12:10 -07:00
Zach Brown	2b72c57cb0	Fix crash in quorum_heartbeat_timeout_ms parsing Mount option parsing runs early enough that the rest of the option read/write serialization infrastructure isn't set up yet. The quorum_heartbeat_timeout_ms mount option tried to use a helper that updated the stored option but it wasn't initialized yet so it crashed. The helper was really only to have the option validity test in one place. It's reworked to only verify the option and the actual setting is left to the callers. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-22 16:29:56 -07:00
Zach Brown	15de0c21c1	Have quorum drop messages on force unmount Forced unmount is supposed to isolate the mount from the world. The net.c TCP messaging returns errors when sending during forced unmount. The quorum code has its own UDP messaging and wasn't taking forced unmount into account. This lead to quorum still being able to send resignation messages to other quorum peers during forced unmount, making it hard to test heartbeat timeouts with forced unmount. The quorum messaging is already unreliable so we can easily make it drop messages during forced unmount. Now forced unmount more fully isolates the quorum code and it becomes easier to test. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-18 10:01:19 -07:00
Zach Brown	7b65767803	Track and log quorum heartbeat delays Add tracking and reporting of delays in sending or receiving quorum heartbeat messages. We measure the time between back to back sends or receives of heartbeat messages. We record these delays truncated down to second granularity in the quorum sysfs status file. We log messages to the console for each longest measured delay up to the maximum configurable heartbeat timeout. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-17 14:44:27 -07:00
Zach Brown	46640e4ff9	Add counter for quorum heartbeat send failures Add a counter which tracks the number of heartbeat message send attempts which fail. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-17 14:44:27 -07:00
Zach Brown	912906f050	Make quorum heartbeat timeout tunable Add mount and sysfs options for changing the quorum heartbeat timeout. This allows setting a longer delay in taking over for failed hosts that has a greater chance of surviving temporary non-fatal delays. We also double the existing default timeout to 10s which is still reasonably responsive. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-17 14:44:27 -07:00
Zach Brown	ec02cf442b	Use lower latency allocation in quorum socket The quorum udp socket allocation still allowed starting io which can trigger longer latencies trying to free memory. We change the flags to prefer dipping into emergency pools and then failing rather than blocking trying to satisfy an allocation. We'd much rather have a given heartbeat attempt fail and have the opportunity to succeed at the next interval rather than running the risk of blocking across multiple intervals. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-17 14:44:27 -07:00
Zach Brown	0e9cd1eea5	Use specific work queue for quorum work The quorum work was using the system workq. While that's mostly fine, we can create a dedicated workqueue with the specific flags that we need. The quorum work needs to run promptly to avoid fencing so we set it to high priority. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-17 14:44:27 -07:00
Zach Brown	e18ea24561	Move quorum recv that sets timeout before check In the quorum work loop some message receive actions extend the timeout after the timeout expiration is checked. This is usually fine when the work runs soon after the messages are received and before the timeout expires. But under load the work might not schedule until long after both the message has been received and the timeout has expired. If the message was a heartbeat message then the wakeup delay would be mistaken for lack of activity on the server and it would try to take over for an otherwise active server. This moves the extension of the heartbeat on message receive to before the timeout is checked. In our case of a delayed heartbeat message it would still find it in the recv queue and extend the timeout, avoiding fencing an active server. Signed-off-by: Zach Brown <zab@versity.com>	2023-05-17 09:56:53 -07:00
Zach Brown	bb01a3990f	Set sb->s_time_gran to support nsecs We missed initializing sb->s_time_gran which controls how some parts of the kernel truncate the granularity of nsec in timespec. Some paths don't use it at all so time would be maintained at full precision. But other paths, particularly setattr_copy() from userspace and notify_change() from the kernel use it to truncate as times are set. Setting s_time_gran to 1 maintains full nsec precision. Signed-off-by: Zach Brown <zab@versity.com>	2023-03-24 10:50:34 -07:00
Zach Brown	a61b8d9961	Fix renaming into root directory The VFS performs a lot of checks on renames before calling the fs method. We acquire locks and refresh inodes in the rename method so we have to duplciate a lot of the vfs checks. One of the checks involves loops with ancestors and subdirectories. We missed the case where the root directory is the destination and doesn't have any parent directories. The backref walker it calls returns -ENOENT instead of 0 with an empty set of parents and that error bubbled up to rename. The fix is to notice when we're asking for ancestors of the one directory that can't have ancestors and short circuit the test. Signed-off-by: Zach Brown <zab@versity.com>	2023-03-08 11:00:59 -08:00
Zach Brown	2e2ccb6f61	Allow replaying srch file rotation When a client no longer needs to append to a srch file, for whatever reason, we move the reference from the log_trees item into a specific srch file btree item in the server's srch file tracking btree. Zeroing the log_trees item and inserting the server's btree item are done in a server commit and should be written atomically. But commit_log_trees had an error handling case that could leave the newly inserted item dirty in memory without zeroing the srch file reference in the existing log_trees item. Future attempts to rotate the file reference, perhaps by retrying the commit or by reclaiming the client's rid, would get EEXIST and fail. This fixes the error handling path to ensure that we'll keep the dirty srch file btree and log_trees item in sync. The desynced items can still exist in the world so we'll tolerate getting EEXIST on insertion. After enough time has passed, or if repair zeroed the duplicate reference, we could remove this special case from insertion. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-17 14:33:27 -08:00
Zach Brown	01c8bba56d	Merge pull request #109 from versity/zab/server_statfs_stable_blocks Zab/server statfs stable blocks	2023-01-12 09:58:48 -08:00
Zach Brown	17cb1fe84b	Merge pull request #110 from versity/zab/partial_alloc_move Allow partial extent motion	2023-01-12 09:58:12 -08:00
Zach Brown	a23e7478a0	Fix move_blocks loop exit conditions The move_blocks ioctl intends to only move extents whose bytes fall inside i_size. This is easy except for a final extent that straddles an i_size that isn't aligned to 4K data blocks. The code that either checked for an extent being entirely past i_size or for limiting the number of blocks to move by i_size clumsily compared i_size offsets in bytes with extent counts in 4KB blocks. In just the right circumstances, probably with the help of a byte length to move that is much larger than i_size, the length calculation could result in trying to move 0 blocks. Once this hit the loop would keep finding that extent and calculating 0 blocks to move and would be stuck. We fix this by clamping the count of blocks in extents to move in terms of byte offsets at the start of the loop. This gets rid of the extra size checks and byte offset use in the loop. We also add a sanity check to make sure that we can't get stuck if, say, corruption resulted in an otherwise impossible zero length extent. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-10 09:34:52 -08:00

1 2 3 4 5 ...

1244 Commits