scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 13:01:23 +00:00

Author	SHA1	Message	Date
Zach Brown	75f9aabe75	Allow compacting logs down to a single page The k-way merge function at the core of the srch file entry merging had some bookkeeping math (calculating number of parents) that couldn't handle merging a single incoming entry stream, so it threw a warning and returned an error. When refusing to handle that case, it was assuming that caller was trying to merge down a single log file which doesn't make any sense. But in the case of multiple small unsorted logs we can absolutely end up with their entries stored in one sorted page. We have one sorted input page that's merging multiple log files. The merge function is also the path that writes to the output file so we absolutely need to handle this case. We more carefully calculate the number of parents, clamping it to one parent when we'd otherwise get "(roundup(1) -> 1) - 1 == 0" when calculating the number of parents from the number of inputs. We can relax the warning and error to refuse to merge nothing. The test triggers this case by putting single search entries in the log files for mounts and unmounting them to force rotation of the mount log files into mergable rotated log files. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	cf512c5fcf	Use inode_count field for statfs file counts Our statfs implementation had clients reading the super block and using the next free inode number to guess how many inodes there might be. We are very aggressive with giving directories private pools of inode numbers to allocate from. They're often not used at all, creating huge gaps in allocated inode numbers. The ratio of the average number of allocations per directory to the batch size given to each directory is the factor that the used inode count can be off by. Now that we have a precise count of active inodes we can use that to return accurate counts of inodes in the files fields in the statfs struct. We still don't have static inode allocation so the fields don't make a ton of sense. We fake the total and free count to give a reasonable estimate of the total files that doesn't change while the free count is calculated from the correct count of used inodes. While we're at it we add a request to get the summed fields that the server can cheaply discover in cache rather than having the client always perform read IOs. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	a53d6d1a8e	Add scoutfs_alloc_foreach_super which takes super Add an alloc_foreach variant which uses the caller's super to walk the allocators rather than always reading it off the device. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	95ed36f9d3	Maintain inode count in super and log trees Add a count of used inodes to the super block and a change in the inode count to the log_trees struct. Client transactions track the change in inode count as they create and delete inodes. The log_trees delta is added to the count in the super as finalized log_trees are deleted. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	94e5bc1457	Remove unused scoutfs_last_ino() Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	366f615c9f	Add support for our format version We had previously started on a relatively simple notion of an interoperability version which wasn't quite right. This fleshes out support for a more functional format version. The super blocks have a single version that defines behaviour of the running system. The code supports a range of versions and we add some initial interfaces for updating the version while the system is offline. All of this together should let us safely change the underlying format over time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	ac2587017e	Add write_nr to quorum blocks Add a write_nr field to the quorum block header which is incremented with every write. Each event also gets a write_nr field that is set to the incremented value from the header. This gives us a history of the order of event updates that isn't sensitive to misconfigured time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	1cdcf41ac7	Move more block read/write functions to util We're adding another command that does block IO so move some block reading and writing functions out of mkfs. We also grow a few function variants and call the write_sync variant from mkfs instead of having it manually sync. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	024426df28	Add a file for userspace quorum config helpers Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	a0690070ae	Don't null terminate our note strings The code that shows the note sections as files uses the section size to define the size of the notes payload. We don't need to null terminate the strings to define their lengths. Doing so puts a null in the notes file which isn't appreciated by many readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	4e00f95014	run-tests builds our targets with -j The test harness might as well use all cpus when building. It's reasonably safe to assume both that the test systems are otherwise idle and that the build is likely to succeed. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	0c95388f3b	Set TCP_USER_TIMEOUT in addition to keepalives TCP keepalive probes only work when the connection is idle. They're not sent when there's unacked send data being retramnsmitted. If the server fails while we're retransmitting we don't break the connection and try to elect and connect to a new server until the very long default conneciton timeouts or the server comes back and the stale connection is aborted. We can set TCP_USER_TIMEOUT to break an unresponsive connection when there's written data. It changes the behavior of the keepalive probes so we rework them a bit to clearly apply our timeout consistently between the two mechanisms. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	d255dd3b32	Fix SCOUTFs typo in totl name nr define Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:10:45 -07:00
Zach Brown	9b4ac64312	Consistently stop fencing as server stops As the server comes up it needs to fence any previous servers before it assumes exclusive access to the device. If fencing fails it can leave fence requests behind. The error path for these very early failures didn't shut down fencing so we'd have lingering fence requests span the life cycle of server startup and shutdown. The next time the server starts up in this mount it can try to create the fence request again, get an error because a lingering one already exists, and immediately shut down. The result is that fencing errors that hit that initial attempt during server startup can become persistent fencing errors for the lifetime of that mount, preventing it from every successfully starting the server. Moving the fence stop call to hit all exiting error paths consistently clean up fence requests and avoid this problem. The next server instance will get a chance to process the fence request again. It might well hit the same error, but at least it gets a chance. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:10:45 -07:00
Zach Brown	22f9ab4dab	Merge pull request #53 from bgly/fix_mkdir_test Fix mkdir-rename-rmdir test script	2021-10-26 11:53:15 -07:00
Bryant Duffy-Ly	501953d69e	Fix mkdir-rename-rmdir test script The current script gets stuck in an infinite loop when the test suite is started with 1 mount point. This is due to the advancement part of the script in which it advances the ops for each mount. The current while loop checks for when the op_mnt wraps by checking if it equals 0. But the problem is we set each of the op_mnts to 0 during the advancement, so when it wraps it still equates to 0, so it is an infinite loop. Therefore, the fix is to check at the end of the loop check if the last op's mount number wrapped. If so just break out. Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>	2021-10-21 11:41:02 -05:00
Bryant Duffy-Ly	66b8c5fbd7	Enhance clarify of some kfree paths In some of the allocation paths there are goto statements that end up calling kfree(). That is fine, but in cases where the pointer is not initially set to NULL then we might have an undefined behavior. kfree on a NULL pointer does nothing, so essentially these changes should not change behavior, but clarifies the code path better. Signed-off-by: Bryant Duffy-Ly <bduffyly@versity.com>	2021-10-06 18:07:27 -05:00
Zach Brown	3c6c2194bd	Merge pull request #51 from versity/zab/totl_xattr_tag Zab/totl xattr tag	2021-09-13 18:06:28 -07:00
Zach Brown	6ca8c0eec2	Consistently initialize dentry info Unfortunately, we're back in kernels that don't yet have d_op->d_init. We allocate our dentry info manually as we're given dentries. The recent verification work forgot to consistently make sure the info was allocated before using it. Fix that up, and while we're at it be a bit more robust in how we check to see that it's been initialized without grabbing the d_lock. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	ea2b01434e	Add support for i_version This adds i_version to our inode and maintains it as we allocate, load, modify, and store inodes. We set the flag in the superblock so in-kernel users can use i_version to see changes in our inodes. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	d5eec7d001	Fix uninitialized srch ret that won't happen More recent gcc notices that ret in delete_files can be undefined if nr is 0 while missing that we won't call delete_files in that case. Seems worth fixing, regardless. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	ab92d8d251	Add quick test for racing creates Add a quick test to make sure that create is validating stale dentries before deciding if it should create or return -eexist. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	b9a0f1709f	Add xattr .totl. tag Add the .totl. xattr tag. When the tag is set the end of the name specifies a total name with 3 encoded u64s separated by dots. The value of the xattr is a u64 that is added to the named total. An ioctl is added to read the totals. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	a59fd5865d	Add seq and flags to btree items The fs log btrees have values that start with a header that stores the item's seq and flags. There's a lot of sketchy code that manipulates the value header as items are passed around. This adds the seq and flags as core item fields in the btree. They're only set by the interfaces that are used to store fs items: _insert_list and _merge. The rest of the btree items that use the main interface don't work with the fields. This was done to help delta items discover when logged items have been merged before the finalized lob btrees are deleted and the code ends up being quite a bit cleaner. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-09 14:44:55 -07:00
Zach Brown	46edf82b6b	Add inode crtime creation time Add an inode creation time field. It's created for all new inodes. It's visible to stat_more. setattr_more can set it during restore. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-03 11:14:41 -07:00
Zach Brown	e9078d83bf	Merge pull request #50 from versity/zab/verify_dentries Verify dentries after locking	2021-08-31 11:48:29 -07:00
Zach Brown	79fbaa6481	Verify dentries after locking Our dir methods were trusting dentry args. The vfs code paths use i_mutex to protect dentries across revalidate or lookup and method calls. But that doesn't protect methods running in other mounts. Multiple nodes can interleave the initial lookup or revalidate then actual method call. Rename got this right. It is very paranoid about verifying inputs after acquiring all the locks it needs. We extend this pattern to the rest of the methods that need to use the mapping of name to inode (and our hash and pos) in dentries. Once we acquire the parent dir lock we verify that the dentry is still current, returning -EEXIST or -ENOENT as appropriate. Along these lines, we tighten up dentry info correctness a bit by updating our dentry info (recording lock coverage and hash/pos) for negative dentries produced by lookup or as the result of unlink. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-31 09:49:32 -07:00
Zach Brown	9b9d3cf6fc	Merge pull request #49 from versity/zab/btree_merge_fixes Zab/btree merge fixes	2021-08-25 11:50:40 -07:00
Zach Brown	ad5662b892	Handle dupe invalidation requests during recovery Client lock invalidation handling was very strict about not receiving duplicate invalidation requests from the server because it could only track one pending request. The promise to only send one invalidate at a time is made by one server, it can't be enforced across server failover. Particularly because invalidation processing can have to do quite a lot of work with the server as it tears down state associated with the lock. We fix this by recording and processing each individual incoming invalidation request on the lock. The code that handled reordering of incoming grant responses and invalidation requests waited for the lock's mode to match the old mode in the invalidation request before proceeding. That would have prevented duplicate invalidation requests from making forward progress. To fix this we make lock client recieve processing synchronous instead of going through async work which can reorder. Now grant responses are processed as they're received and will always be resolved before all the invalidation requests are queued and processed in order. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	f5577e26b1	Reset item state when retrying stale forest reads The forest reader reads items from the fs_root and all log btrees and gives them to the caller who tracks them to resolve version differences. The reads can run into stale blocks which have been overwritten. The forest reader was implementing the retry under the item state in the caller. This can corrupt items that are only seen firest in an old fs root before a merge and then only seen in the fs_root after a merge. In this case the item won't have any versioning and the existing version from the old fs_root is preferred. This is particularly bad when the new version was deleted -- in that case we have no metadata which would tell us to drop the old item that was read from the old fs_root. This is fixed by pushing the retry up to callers who wipe the item state before each retry. Now each set of items is related to a single snapshot of the fs_root and logs at one point in time. I haven't seen definitive evidence of this happening in practice. I found this problem after putting on my craziest thinking toque and auditing the code for places where we could lose item updates. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5f57785790	Fix btree merge input item iteration Btree merging attempted to build an rbtree of the input roots with only one version of an item present in the rbtree at a time. It really messed this up by completely dropping an input root when a root with a newer version of its item tried to take its place in the rbtree. What it should have done is advance to the next item in the older root, which itself could have required advancing some other older root. Dropping the root entirely is catastrophically wrong because it hides the rest of the items in the root from merging. This has been manifesting as occasional mysterious item loss during tests where memory pressure, item update patterns, and merging all lined up just so. This fixes the problem by more clearly keeping the next item in each root in the rbtree. We sort by newest to oldest version so that once we merge the most recent version of an item its easy to skip all the older versions of the items in the next rbtree entries for the rest of the input roots. While we're at it we work with references to the static cached input btree blocks. The old code was a first pass that used an expensive btree walk per item and copied the value payload. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	2a33b9faf0	Add some error testing to srch-basic-functionality When the xattr inode searchs fail the test will eventually fail when the output differs, but that could take a while. Have it fail much sooner so that we can have tighter debugging interations and trace ring buffer contents that are likely to be a lot closer to the first failure. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	3740c0a995	More carefully scan for orphan inodes The current orphan scan uses the forest_next_hint to look for candidate orphan items to delete. It doesn't skip deleted items and checks the forest of log btrees so it'd return hints for every single item that existed in all the log btrees across the system. And we call the hint per-item. When the system is deleting a lot of files we end up generating a huge load where all mounts are constantly getting the btree roots from the server, reading all the newest log btree blocks, finding deleted orphan items for inodes that have already been deleted, and moving on to the next deleted orphan item. The fix is to use a read-only traversal of only one version of the fs root for all the items in one scan. This avoids all the deleted orphan items that exist in the log btrees which will disappear when they're merged. It lets the item iteration happen in a single read-only cached btree instead of constantly reading in the most recently written root block of every log btree. The result is an enormous speedup of large deletions. I don't want to describe exactly how enormous. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	a4f5293e78	Flush invalidate and iput inode references We can be performing final deletion as inodes are evicted during unmount. We have to keep full locking, transactions, and networking up and running for the evict_inodes() call in generic_shutdown_super(). Unfortunately, this means that workers can be using inode references during evict_inodes() which prevents them from being evicted. Those workers can then remain running as we tear down the system, causing crashes and deadlocks as the final iputs try to use resources that have been destroyed. The fix is to first properly stop orphan scanning, which can instantiate new cached inodes, up before the call to kill_block_super ends up trying to evict all inodes. Then we just need to wait for any pending iput and invalidate work to finish and perform the final iput, which will always evict because generic_shutdown_super has cleared MS_ACTIVE. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	0c3026a2b7	Add simple per-lock server message count stats Add some simple tracking of message counts for each lock in the lock server so that we can start to see where conflicts may be happening in a running system. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5bc95fac7d	Add scoutfs_unmounting() Add a quick helper that can be used to avoid doing work if we know that we're already shutting down. This can be a single coarser indicator than adding functions to each subsystem to track that we're shutting down. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	36fcc4665d	Align first free ino to lock group Currently the first inode number that can be allocated directly follows the root inode. This means the first batch of allocated inodes are in the same lock group as the root inode. The root inode is a bit special. It is always hot as absolute path lookups and inode-to-path resolution always read directory entries from the root. Let's try aligning the first free inode number to the next inode lock group boundary. This will stop work in those inodes from necessarily conflicting with work in the root inode. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	b0a08eb922	Remove lock grace period We had some logic to try and delay lock invalidation while the lock was still actively in use. This was trying to reduce the cost of pathological lock conflict cases but it had some severe fairness problems. It was first introduced to deal with bad patterns in userspace that no longer exist and it was built on top of the LSM transaction machinery that also no longer exists. It hasn't aged well. Instead of introducing invalidation latency in the hopes that it leads to more batched work, which it can't always, let's aim more towards reducing latency in all parts of the write-invalidate-read path and also aim towards reducing contention in the first place. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	bb571377dc	Don't merge newer items past older We have a problem where items can appear to go backwards in time because of the way we chose which log btrees to finalize and merge. Because we don't have versions in items in the fs_root, and even might not have items at all if they were deleted, we always assume items in log btrees are newer than items in the fs root. This creates the requirement that we can't merge a log btree if it has items that are also present in older versions in other log btrees which are not being merged. The unmerged old item in the log btree would take precedent over the newer merged item in the fs root. We weren't enforcing this requirement at all. We used the max_item_seq to ensure that all items were older than the current stable seq but that says nothing about the relationship between older items in the finalized and active log btrees. Nothing at all stops an active btree from having an old version of a newer item that is present in another mount's finalized log btree. To reliably fix this we create a strict item seq discontinuity between all the finalized merge inputs and all the active log btrees. Once any log btree is naturally finalized the server forced all the clients to group up and finalize all their open log btrees. A merge operation can then safely operate on all the finalized trees before any new trees are given to clients who would start using increasing items seqs. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	5897f4d889	Add a trivial trace_printk wrapper Make it a bit easier to include the fsid and rid in trace_printk messages when we're experimenting. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:12:20 -07:00
Zach Brown	999093bfc9	Add sync log trees network command Add a command for the server to request that clients commit their open transaction. This will be used to create groups of finalized log btrees for consistent merging. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:12:17 -07:00
Zach Brown	05b5d93365	Verify that quorum_slot_nr references valid slot We were checking that quorum_slot_nr was within the range of possible slots allowed by the format as it was parsed. We weren't checking that it referenced a configured slot. Make sure, and give a nice error message that shows the configured slots. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	4d7191dc48	Print messages on extent ins/rem errors Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	4495dbdce6	Set initial quorum term from max of all blocks During rough forced unmount testing we saw a seemingly mysterious concurrent election. It could be explained if mounts coming up don't start with the same term. Let's try having mounts initialize their term to the greatest of all the terms they can see in the quorum blocks. This will prevent the situation where some new quorum actors with greater terms start out ignoring all the messages from others. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	70569b0448	Trivial quorum test;set -> test_and_set Nothing interesting here, just a minor convenience to use test and set instead of testing and then setting. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	823838cf01	Add more messages to server processing errors The server doesn't give us much to go on when it gets an error handling requests to work with log trees from the client. This adds a lot of specific error messages so we can get a better understanding of failures. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-24 09:11:40 -07:00
Zach Brown	89b5865a4c	Verify that log tree commit is for sending rid We were trusting the rid in the log trees struct that the client sent. Compare it to our recorded rid on the connection and fail if the client sent the wrong rid. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-17 12:13:01 -07:00
Zach Brown	7cf9cd8c20	Merge pull request #48 from versity/zab/missed_invalidate_wakeup Queue invalidation during previous request	2021-08-09 09:50:39 -07:00
Zach Brown	65ac42831f	Queue invalidation during previous request The locking protocol only allows one outstanding invalidation request for a lock at a time. The client invalidation state is a bit hairy and involves removing the lock from the invalidation list while it is being processed which includes sending the response. This means that another request can arrive while the lock is not on the invalidation list. We have fields in the lock to record another incoming request which puts the lock back on the list. But the invalidation work wasn't always queued again in this case. It looks like the incoming request path would queue the work, but by definition the lock isn't on the invalidation list during this race. If it's the only lock in play then the invalidation list will be empty and the work won't be queued. The lock can get stuck with a pending invalidation if nothing else kicks the invaliation worker. We saw this in testing when the root inode lock group missed the wakeup. The fix is to have the work requeue itself after putting the lock back on the invalidation list when it notices that another request came in. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-06 15:41:11 -07:00
Zach Brown	dde6dab0a1	Merge pull request #47 from versity/zab/stability_fixes Zab/stability fixes	2021-08-02 12:22:44 -07:00

1 2 3 4 5 ...

1505 Commits