scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-09 05:13:18 +00:00

Author	SHA1	Message	Date
Zach Brown	a67ea30bb7	Add orphan_scan_delay_ms mount option Add a mount option to set the delay betwen scanning of the orphan list. The sysfs file for the option is writable so this option can be set at run time. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-10 11:43:11 -08:00
Zach Brown	f3b7c683f0	Fix quorum_server_nr syfs file name typo The quorum_slot_nr mount option was being mistakenly shown in a sysfs file named quorum_server_nr. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	8decc54467	Clean up mount option handling The mount options code is some of the oldest in the tree and is weirdly split between options.c and super.c. This cleans up the options code, moves it all to options.c, and reworks it to be more in line with the modern subsystem convenction of storing state in an allocated info struct. Rather than putting the parsed options in the super for everyone to directly reference we put them in the private options info struct and add a locked read function. This will let us add sysfs files to change mount options while safely serializing with readers. All the users of mount options that used to directly reference the parsed struct now call the read function to get a copy. They're all small local changes except for quorum which saves a static copy of the quorum slot number because it references it in so many places and relies on it not changing. Finally, we remove the empty debugfs "options" directory. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	5adcf7677f	Export omap group calc helper The inode caller of omap was manually calculating the group and bits, which isn't fantastic. Export the little helper to calculate it so the inode caller doesn't have to. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:36 -08:00
Zach Brown	07f03d499f	Remove duplicate orphan work delay calculation You can almost feel the editing mistake that brought the delay calculation into the conditional and forgot to remove the initial calculation at declaration. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-09 11:12:23 -08:00
Zach Brown	66678dc63b	Fail mounts with unknown options Weirdly, the mout option parser silently returned when it found mount options that weren't recognized. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-21 10:44:56 -08:00
Zach Brown	4d6350b3b0	Fix lock ordering in fallocate We were seeing ABBA deadlocks on the dio_count wait and extent_sem between fallocate and reads. It turns out that fallocate got lock ordering wrong. This brings fallocate in line with the rest of the adherents to the lock heirarchy. Most importantly, the extent_sem is used after the dio_count. While we're at it we bring the i_mutex down to just before the cluster lock for consistency. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-17 14:48:13 -08:00
Zach Brown	730a84af92	Silence resent log merge commit error The server's log merge complete request handler was considering the absence of the client's original request as a failure. Unfortunately, this case is possible if a previous server successfully completed the client's request but the response was lost because it stopped for whatever reason. The failure was being logged as a hard error to the console which was causing tests to occasionally fail during server failover that hit just as the log merge completion was being processed. The error was being sent to the client as a response, we just need to silence the message for these expected but rare errors. We also fix the related case where the server printed the even more harsh WARN_ON if there was a next original request but it wasn't the one we expected to find from our requesting client. Signed-off-by: Zach Brown <zab@versity.com>	2022-02-02 11:26:36 -08:00
Zach Brown	329ac0347d	Remove unused scoutfs_net_cancel_request() The net _cancel_request call hasn't been used or tested in approximately a bazillion years. Best to get rid of it and have to add and test it if we think we need it again. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	15d7eec1f9	Disallow openening unlinked files by handle Our open by handle functions didn't care that the inode wasn't referenced and let tasks open unlinked inodes by number. This interacted badly with the inode deletion mechanisms which required that inodes couldn't be cached on other nodes after the transaction which removed their final reference. If a task did accidentally open a file by inode while it was being deleted it could see the inode items in an inconsistent state and return very confusing errors that look like corruption. The fix is to give the handle iget callers a flag to tell iget to only get the inode if it has a positive nlink. If iget sees that the inode has been unlinked it returns enoent. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	cff17a4cae	Remove unused flags scoutfs_inode_refresh arg The flags argument to scoutfs_inode_refresh wasn't being used. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	7a96e03148	Add get_allocated_inos ioctl Add an ioctl that can give some indication of inodes that have inode items. We're exposing this for tests that verify the handling of open unlinked inodes. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e9b3cc873a	Export scoutfs_inode_init_key We're adding an ioctl that wants to build inode item keys so let's export the private inode key initializer. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	5f2259c48f	Revert "Fix client/server race btwn lock recov and farewell" This reverts commit `61ad844891`. This fix was trying to ensure that lock recovery response handling can't run after farewell calls reclaim_rid() by jumping through a bunch of hoops to tear down locking state as the first farewell request arrived. It introduced very slippery use after free during shutdown. It appears that it was from drain_workqueue() previously being able to stop chaining work. That's no longer possible when you're trying to drain two workqueues that can queue work in each other. We found a much clearer way to solve the problem so we can toss this. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:40:08 -08:00
Zach Brown	e14912974d	Wait for lock recovery before sending farewell We recently found that the server can send a farewell response and try to tear down a client's lock state while it was still in lock recovery with the client. The lock recovery response could add a lock for the client after farell's reclaim_rid() had thought the client was gone forever and tore down its locks. This left a lock in the lock server that wasn't associated with any clients and so could never be invalidated. Attempts to acquire conflicting locks with it would hang forever, which we saw as hangs in testing with lots of unmounting. We tried to fix it by serializing incoming request handling and forcefully clobbering the client's lock state as we first got the farewell request. That went very badly. This takes another approach of trying to explicitly wait for lock recovery to finish before sending farewell responses. It's more in line with the overall pattern of having the client be up and functional until farewell tears it down. With this in place we can revert the other attempted fix that was causing so many problems. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-24 09:39:51 -08:00
Zach Brown	e2ce5ab6da	Free pending recovery state on shutdown scoutfs_recov_shutdown() tried to move the recovery tracking structs off the shared list and into a private list so they could be freed. But then it went and walked the now empty shared list to free entries. It should walk the private list. This would leak a small amount of memory in the rare cases where the server was shutdown while recovery was still pending. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-19 09:22:48 -08:00
Zach Brown	e3c7e21c40	Use write memory barrier in set_shutting_down The server's little set_shutting_down() helper accidentally used a read barrier instead of a write barrier. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-19 09:17:38 -08:00
Zach Brown	e97ea5407d	Merge pull request #64 from bgly/bduffyly/quorum_race Fix client/server race between lock recov and farewell processing	2022-01-14 09:03:00 -08:00
Bryant G. Duffy-Ly	8db5c118c3	Change clent to c_ent To make it clearer; changing clent to c_ent to represent client entry. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2022-01-13 13:33:05 -06:00
Bryant G. Duffy-Ly	61ad844891	Fix client/server race btwn lock recov and farewell Tear down client lock server state and set a boolean so that there is no race between client/server processing lock recovery at the same time as farewell. Currently there is a bug where if server and clients are unmounted then work from the client is processed out of order, which leaves behind a server_lock for a RID that no longer exists. In order to fix this we need to serialize SCOUTFS_NET_CMD_FAREWELL in recv_worker. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2022-01-13 13:32:56 -06:00
Zach Brown	99a1cc704f	Preserve item cache page max_seq as items move The max_seq and active reader mechanisms in the item cache stop readers from reading old items and inserting them in the cache after newer items have been reclaimed by memory pressure. The max_seq field in the pages must reflect the greatest seq of the items in the page so that reclaim knows that the page contains items newer than old readers and must not be removed. We update the page max_seq as items are inserted or as they're dirtied in the page. There's an additional subtle effect that the max_seq can also protect items which have been erased. Deletion items are erased from the pages as a commit completes. The max_seq in that page will still protect it from being reclaimed even though no items have that seq value themselves. That protection fails if the range of keys containing the erased item is moved to another page with a lower max_seq. The item mover only updated the destination page's max_seq for each item that was moved. It missed that the empty space between the items might have a larger max_seq from an erased item. We don't know where the erased item is so we have to assume that a larger max_seq in the source page must be set on the destination page. This could explain very rare item cache corruption where nodes were seeing deleted directory entry items reappearing. It would take a specific sequence of events involving large directories with an isolated removal, a delayed item cache reader, a commit, and then enough insertions to split the page all happening in precisely the wrong sequence. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-12 10:23:55 -08:00
Bryant G. Duffy-Ly	1b8e3f7c05	Add basic renameat2 syscall support Support generic renameat2 syscall then add support for the RENAME_NOREPLACE flag. To suppor the flag we need to check the existance of both entries and return -EXIST. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-19 17:54:02 -06:00
Bryant G. Duffy-Ly	0fc8ccb122	Fix exiting out of btree_walk early for force_umnt We do not want to short-circuit btree_walk early, it is better to handle the force unmount on the caller side. Therefore, remove this from btree_walk. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-05 15:21:09 -05:00
Bryant G. Duffy-Ly	e4a3c2b95d	Break client/server out of waiting network replies If there is a forced unmount we call _net_shutdown from umount_begin in order to tell the server and clients to break out of pending network replies. We then add the call to abort within the shutdown_worker since most of the mucking with send and resend queues are all done there. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-05 15:21:04 -05:00
Bryant G. Duffy-Ly	cf4e6611d3	Fix inconsistency assertions at commit_log_merge Only BUG_ON for inconsistency and not do it for commit errors or failure to delete the original request. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-05 15:18:57 -05:00
Bryant G. Duffy-Ly	65429a9cc4	Ensure that writer_init and alloc_init are cleaned In scoutfs_server_worker we do not properly handle the clean up of _block_writer_init and alloc_init. On error paths we can clean up the context if either of thoes are initialized we can call alloc_prepare_commit or writer_forget_all to ensure we drop the block references and clear the dirty status of all the blocks in the writer. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-05 15:05:42 -05:00
Bryant G. Duffy-Ly	83a6bbb640	Fix inconsistency in server_log_merge_free_work In order to safely free blocks we need to first dirty the work. This allows for resume later on without a double free. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-03 17:09:51 -05:00
Zach Brown	f02d68f567	Merge pull request #55 from versity/zab/v1_format_version Zab/v1 format version	2021-11-03 10:18:50 -07:00
Zach Brown	1b4d291bf7	Fix xattr update out of bounds access As we update xattrs we need to update any existing old items with the contents of the new xattr that uses those items. The loop that updated existing items only took the old xattr size into account and assumed that the new xattr would use those items. If the new xattr size used fewer parts then the attempt to update all the old parts that weren't covered by the new size would go very wrong. The length of the region in the new xattr would be negative so it'd try to use the max part length. Worse, it'd copy these max part length regions outside the input new xattr buffer. Typically this would land in addressible memory and copy garbage into the unused old items before they were later deleted. However, it could access so far outside the input buffer that it could cross a page boudary into inaccessible memory and fault. We saw this in the field while trying to repeatedly incrementally shrink a large xattr. This fixes the loop that updates overlapping items between the new and old xattr to start with the smaller of their two item counts. Now it will only update items that are actually used by both xattrs and will only safely access the new xattr input buffer. Signed-off-by: Zach Brown <zab@versity.com>	2021-11-01 11:33:17 -07:00
Zach Brown	223ee5deef	Declare v1 of the stable persistent format From now on if we make incompatible changes to structures or messages then we update the format version and ensure that the code can deal with all the versions in its supported range. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	8f60ac06c5	Clean up our ioctl numbers We had arbitrarily chosen an ioctl code 's' to match scoutfs, but of course that conflicts. This chooses an arbitrary hole in the upstream reservations from inode-numbers.rst. Then we make sure to have our _IO[WR] usage reflect the direction of the final type paramater. For most of our ioctls userspace is writing an argument parameter to perform an operation (that often has side effects). Most of our ioctls should be _IOW because userspace is writing the parameter, not _IOR (though the operation tends to read state). A few ioctls copy output back to userspace in the parameter so they're _IOWR. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	932a842ae3	Remove valid_bytes from stat _more ioctls The idea here was that we'd expand the size of the struct and valid_bytes would tell the kernel which fields were present in userspace's struct. That doesn't combine well with the ioctl convention of having the size of the type baked into the ioctl number. We'll remove this to make the world less surprising. If we expand the interface we'd add additional ioctls and types. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	618a7a4c47	Remove unused lock server alloc and wri While checking in on some other code I noticed that we have lingering allocator and writer contexts over in the lock server. The lock server used to manage its own client state and recovery. We've sinced moved that into shared recov functionality in the server. The lock server no longer manipulates its own btrees and doesn't need these unused references to the server's contexts. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	9ebf43db99	Spread out key zone and type values Introduce some space between the current key zone and type values so that we have room to insert new keys amongst the current keys if we need to. A spacing of 4 is arbitrarily chosen as small enough to still give us intuitively small numbers while leaving enough room to grow, given how long its taken to come to the current number of keys. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	e38beee85a	Stop using inode index key type as array index The code that updates inode index items on behalf of indexed fields uses an array to track changes in the fields. Those array indexes were the raw key type values. We're about to introduce some sparse space between all the key values so that we have some room to add keys in the future at arbitrary sort positions amongst the previous keys. We don't want the inode index item updating code to keep using raw types as array indices when the type values are no longer small dense values. We introduce indirection from type values to array indices to keep the tracking array in the in-memory inode struct small. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	20ac2e35fa	Remove clock_sync field from net message As we freeze the format let's remove this old experiment to try and make it easier to line up traces from different mounts. It never worked particularly well and I think it could be argued that trying to merge trace logs on different machines isn't a particularly meaningful thing to do. You care about how they interact not what they were doing at the same time with their indepdendent resources. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	80ee2c6d57	Harden client transaction processing There are a few bad corner cases in the state machine that governs how client transactions are opened, modified, and committed. The worst problem is on the server side. All server request handlers need to cope with resent requests without causing bad side effects. Both get_log_trees and commit_log_trees would try to fully processes resent requests. _get_log_trees() looks safe because it works with the log_trees that was stored previously. _commit_log_trees() is not safe because it can rotate out the srch log file referenced by the sent log_trees every time it's processed. This could create extra srch entries which would delete the first instance of entries. Worse still, by injecting the same block structure into the system multiple times it ends up causing multiple frees of the blocks that make up the srch file. The client side problems are slightly different, but related. There aren't strong constraints which guarantee that we'll only send a commit request after a get request succeeds. In crazy circumstances the commit request in the write worker could come before the first get in mount succeeds. Far worse is that we can send multiple commit requests for one transaction if it changes as we get errors during multiple queued write attempts, particularly if we get errors from get_log_trees after having successfully committed. This hardens all these paths to ensure a strict sequence of get_log_trees, transaction modification, and commit_log_trees. On the server we add _trans_seq fields to the log_trees struct so that both get_ and commit_ can see that they've already prepared a commit to send or have already committed the incoming commit, respectively. We can use the get_trans_seq field as the trans_seq of the open transaction and get rid of the entire seperate mechanism we used to have for tracking open trans seqs in the clients. We can get the same info by walking the log_trees and looking at their _trans_seq fields. In the client we have the write worker immediately return success if mount hasn't opened the first transaction. Then we don't have the worker return to allow further modification until it has gotten success from get_log_trees. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	42c4c6dd24	Move transaction sbi fields to trans_info The transaction code was built a million years ago and put all of its data in our core super block info. This finally moves the rest of the private transaction fields out of the core super block and into the transaction info. This makes it clear that it's private to trans.c and brings it line with the rest of the subsystems in the tree. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	7d71b610af	Add server extent motion tracking Add tracking in the alloc functions that the server uses to move extents between allocator structures on behalf of client mounts. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	70ede28e39	Remove unused traced_extent leavings Remove some lingering support helpers for the traced_extent struct that we haven't used in a while. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	b477604339	Don't clobber srch compact errors The srch compaction worker will wait a bit before attempting another compaction as it finishes a compaction that failed. Unfortunately, it clobbered the errors it got during compaction with the result of sending the commit to the server with the error flag. If the commit is successful then it thinks there were no errors and immediately re-queues itself to try the next compaction. If the error is persistent, as it was with a bug in how we merged log files with a single page's worth of entries, then we can spin indefinitely getting and error, clobbering the error with the commit result, and immediately queueing our work to do it all over again. This fix preserves existing errors when geting the result of the commit and will correctly back off. If we get persistent merge errors at least they won't consume significant resources. We add a counter for commit for the errors so we can get some visibility if this happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	75f9aabe75	Allow compacting logs down to a single page The k-way merge function at the core of the srch file entry merging had some bookkeeping math (calculating number of parents) that couldn't handle merging a single incoming entry stream, so it threw a warning and returned an error. When refusing to handle that case, it was assuming that caller was trying to merge down a single log file which doesn't make any sense. But in the case of multiple small unsorted logs we can absolutely end up with their entries stored in one sorted page. We have one sorted input page that's merging multiple log files. The merge function is also the path that writes to the output file so we absolutely need to handle this case. We more carefully calculate the number of parents, clamping it to one parent when we'd otherwise get "(roundup(1) -> 1) - 1 == 0" when calculating the number of parents from the number of inputs. We can relax the warning and error to refuse to merge nothing. The test triggers this case by putting single search entries in the log files for mounts and unmounting them to force rotation of the mount log files into mergable rotated log files. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	cf512c5fcf	Use inode_count field for statfs file counts Our statfs implementation had clients reading the super block and using the next free inode number to guess how many inodes there might be. We are very aggressive with giving directories private pools of inode numbers to allocate from. They're often not used at all, creating huge gaps in allocated inode numbers. The ratio of the average number of allocations per directory to the batch size given to each directory is the factor that the used inode count can be off by. Now that we have a precise count of active inodes we can use that to return accurate counts of inodes in the files fields in the statfs struct. We still don't have static inode allocation so the fields don't make a ton of sense. We fake the total and free count to give a reasonable estimate of the total files that doesn't change while the free count is calculated from the correct count of used inodes. While we're at it we add a request to get the summed fields that the server can cheaply discover in cache rather than having the client always perform read IOs. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	a53d6d1a8e	Add scoutfs_alloc_foreach_super which takes super Add an alloc_foreach variant which uses the caller's super to walk the allocators rather than always reading it off the device. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	95ed36f9d3	Maintain inode count in super and log trees Add a count of used inodes to the super block and a change in the inode count to the log_trees struct. Client transactions track the change in inode count as they create and delete inodes. The log_trees delta is added to the count in the super as finalized log_trees are deleted. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	94e5bc1457	Remove unused scoutfs_last_ino() Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	366f615c9f	Add support for our format version We had previously started on a relatively simple notion of an interoperability version which wasn't quite right. This fleshes out support for a more functional format version. The super blocks have a single version that defines behaviour of the running system. The code supports a range of versions and we add some initial interfaces for updating the version while the system is offline. All of this together should let us safely change the underlying format over time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	ac2587017e	Add write_nr to quorum blocks Add a write_nr field to the quorum block header which is incremented with every write. Each event also gets a write_nr field that is set to the incremented value from the header. This gives us a history of the order of event updates that isn't sensitive to misconfigured time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	a0690070ae	Don't null terminate our note strings The code that shows the note sections as files uses the section size to define the size of the notes payload. We don't need to null terminate the strings to define their lengths. Doing so puts a null in the notes file which isn't appreciated by many readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	0c95388f3b	Set TCP_USER_TIMEOUT in addition to keepalives TCP keepalive probes only work when the connection is idle. They're not sent when there's unacked send data being retramnsmitted. If the server fails while we're retransmitting we don't break the connection and try to elect and connect to a new server until the very long default conneciton timeouts or the server comes back and the stale connection is aborted. We can set TCP_USER_TIMEOUT to break an unresponsive connection when there's written data. It changes the behavior of the keepalive probes so we rework them a bit to clearly apply our timeout consistently between the two mechanisms. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00

1 2 3 4 5 ...

1145 Commits