scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-06 20:16:25 +00:00

Author	SHA1	Message	Date
Zach Brown	96f2ad29dc	Add inode crtime creation time Add an inode creation time field. It's created for all new inodes. It's visible to stat_more. setattr_more can set it during restore. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-08 11:00:30 -07:00
Zach Brown	b4ede2ac6a	Allow omap responses to disconnected originators The omap message lifecycle is a little different than the server's usual handling that sends a response from the request handler. The response is sent long after the initial receive handler is pinning the connection to the client. It's fine for the response to be dropped. The main server request handler handled this case but other response senders didn't. Put this error handling in the server response sender itself so that all callers are covered. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-08 09:36:07 -07:00
Zach Brown	cbe8d77f78	Prevent duplicate inode item deletion We hide I_FREEING inodes from inode lookup to avoid inversions with cluster locking. This can result in duplicate inodes structs for a given inode number. Then can both race to try and delete the same items for their shared inode number. This leads to error messages from evict_inode and could lead to corruption if they, for example, both try and free the same data extents. This adds very basic serialization so only one instance can try to delete items at a time. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	5f682dabb5	Item cache invalidation uses seqs to avoid readers The item cache has to be careful not to insert stale read items when previously dirty items have been written and invalidated while a read was in flight. This was previously done by recording the possible range of items that a reader could see based on the key range of its lock. This is disasterous when a workload operates entirely within one lock. I ran into this when testing a small number of files with massive amounts of xattrs. While any reader is in flight all pages can't be invalidated because they all intersect with the one lock that covers all the items in use. The fix is to more naturally reflect the problem by tracking the greatest item seq in pages and the earliest seq that any readers can't see. This lets invalidate only skip pages with items that weren't visible to the earliest reader. This more naturally reflects that the problem is due to the age of the items, not their position in the key space. Now only a few of the most recently modified pages could be skipped and they'll be at the end of the LRU and won't typically be visited. As an added benefit it's now much cheaper to add, delete, and test the active readers. This fix stopped rm -rf of a full system's worth of xattrs from taking minutes constantly spinning skipping all pages in the LRU to seconds of doing real removal work. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	29cfa81574	Remove unused leftovers from quorum changes These forward declarations were for interfaces that have since been removed or changed and are no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	9db3b475c0	Stop log merge work earlier during unmount The forest log merge work calls into the client to send commit requests to the server. The forest is usually destroyed relatively late in the sequence and can still be running after the client is destroyed. Adding a _forest_stop call lets us stop the log merging work before the client is destroyed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	2957f3e301	Avoid warnings when evict has signals pending Killing a task can end up in evict and break out of acquiring the locks to perform final inode deletion. This isn't necessarily fatal. The orphan task will come around and will delete the inode when it is truly no longer referenced. So let's silence the error and keep track of how many times it happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:54:56 -07:00
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00
Zach Brown	28759f3269	Rotate srch files as log trees items are reclaimed The log merging work deletes log trees items once their item roots are merged back into the fs root. Those deleted items could still have populated srch files that would be lost. We force rotation of the srch files in the items as they're reclaimed to turn them into rotated srch files that can be compacted. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:37:45 -07:00
Zach Brown	5c3fdb48af	Fix btree join item movement Refilling a btree block by moving items from its siblings as it falls under the join threshold had some pretty serious mistakes. It used the target block's total item count instead of the siblings when deciding how many items to move. It didn't take item moving overruns into account when deciding to compact so it could run out of contiguous free space as it moved the last item. And once it compacted it returned without moving because the return was meant to be in the error case. This is all fixed by correctly examining the sibling block to determine if we should join a block up to 75% full or move a big chunk over, compacting if the free space doesn't have room for an excessive worst case overrun, and fixing the compaction error checking return typo. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	a7828a6410	Add log merge item allocators to alloc detail The alloc iterator needs to find and include the totals of the avail and freed allocator list heads in the log merge items. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	d67db6662b	Fix item cache val_len alignment math Some item_val_len() callers were applying alignment twice, which isn't needed. And additions to erased_bytes as value lengths change didn't take alignment into account. They could end up double counting if val_len changes within the alignment are then accounted for again as the full item and alignment is later deleted. Additions to erased_bytes based on val_len should always take alignment into account. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	c5c050bef0	Item cache might free null page on alloc error The item cache allocates a page and a little tracking struct for each cached page. If the page allocation fails it might try to free a null page pointer, which isn't allowed. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	96d286d6e5	Zero btree item padding as items are created Item creation, which fills out a new item at the end of the array of item structs at the start of the block, didn't explicitly zero the item struct padding to 0. It would only have been zero if the memory was already zero, which is likely for new blocks, but isn't necessarily true if the memory had previously been used by deleted values. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	9febc6b5dc	Update btree block validator for 8byte alignment The change to aligning values didn't update the btree block verifier's total length calculation, and while we're in there we can also check that values are correctly aligned. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	045b3ca8d4	Expand unused btree verifying walker Previously we had an unused function that could be flipped on to verify btree blocks during traversal. This refactors the block verifier a bit to be called by a verifying walker. This will let callers walk paths to leaves to verify the tree around operations, rather than verification being performed during the next walk. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	ff882a4c4f	Add btree total_above_join_low_water() test Take the condition used to decide if a btree block needs to be joined and put it in total_above_join_low_water() so that btree_merging will be able to call it to see if the leaf block it's merging into needs to be joined. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	3d1a0f06c0	Add scoutfs_btree_free_blocks Add a btree function for freeing all the blocks in a btree without having to cow the blocks to track which refs have been freed. We use a key from the caller to track which portions of the tree have been freed. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	91acf92666	Add client btree merge processing Add the client work which is regularly scheduled to ask the server for log merging work to do. The relatively simple client work gets a request from the server, finds the log roots to merge given the reqeust seq, performs the merge with a btree call and callbacks, and commits the result to the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	9c2122f7de	Add server btree merge processing This adds the server processing side of the btree merge functionality. The client isn't yet sending the log_merge messages so no merging will be performed. The bulk of the work happens as the server processess a get_log_merge message to build a merge request for the client. It starts a log merge if one isn't in flight. If one is in flight it checks to see if it should be spliced and maybe finished. In the common case it finds the next range to be merged and sends the request to the client to process. The commit_log_merge handler is the completion side of that request. If the request failed then we unwind its resources based on the stored request item. If it succeeds we record it in an item for get_ processing to splice eventually. Then we modify two existing server code paths. First, get_log_tree doesn't just create or use a single existing log btree for a client mount. If the existing log btree is large enough it sets its finalized flag and advances the nr to use a new log btree. That makes the old finalized log btree available for merging. Then we need to be a bit more careful when reclaiming the open log btree for a client. We can't use next to find the only open log btree, we use prev to find the last and make sure that it isn't already finalized. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	4d3ea3b59b	Add format support for log btree merging Add the format specification for the upcoming btree merging. Log btrees gain a finalized field, we add the super btree root and all the items that the server will use to coordinate merging amongst clients, and we add the two client net messages which the server will implement. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	298a6a8865	Add server get_stable_trans_seq() Extract part of the get_last_seq handler into a call that finds the last stable client transaction seq. Log merging needs this to determine a cutoff for stable items in log btrees. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	082924df1a	Add scoutfs_key_is_ones() Add a quick inline for testing that a key is all ones. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	d8478ed6f1	Add scoutfs_btree_rebalance() Add a btree call to just dirty to a leaf block, joining and splitting along the way so that the blocks in the path satisfy the balance constraints. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	0538c882bc	Add btree_merge() Add a btree function for merging the items in a range from a number of read-only input btrees into a destination btree. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	3a03a6a20c	Add SUBTREE btree walk flag to restrict join/merge Add a BTW_SUBTREE flag to btree_walk() to restrict splitting or joining of the root block. When clients are merging into the root built from a reference to the last parent in the fs tree we want to be careful that we maintain a single root block that can be spliced back into the fs tree. We specifically check that the root block remain within the split/join thresholds. If it falls out of compliance we return an error so that it can be spliced back into the fs tree and then split/joined with its siblings. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	b6d0a45f6d	Add btree_{get,set}_parent Add calls for working with subtrees built around references to blocks in the last level of parents. This will let the server farm out btree merging work where concurrency is built around safely working with all the items and leaves that fall under a given parent block. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	d7f8896fac	Add scoutfs_btree_parent_range Add a btree helper for finding the range of keys which are found in leaves referenced by the last parent block when searching for a given key. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	65c39e5f97	Item seq is max of trans and lock write_seq Rename the item version to seq and set it to the max of the transaction seq and the lock's write_seq. This lets btree item merging chose a seq at which all dirty items written in future commits must have greater seqs. It can drop the seqs from items written to the fs tree during btree merging knowing that there aren't any older items out in transactions that could be mistaken for newer items. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:25:14 -07:00
Zach Brown	3c69861c03	Use core seq for lock write_seq Rename the write_version lock field to write_seq and get it from the core seq in the super block. We're doing this to create a relationship between a client transaction's seq and a lock's write_seq. New transactions will have a greater seq than all previously granted write locks and new write locks will have a greater seq than all open transactions. This will be used to resolve ambiguities in item merging as transaction seqs are written out of order and write locks span transactions. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-15 15:24:23 -07:00
Zach Brown	05ae756b74	Get trans seq from core seq Get the next seq for a client transaction from the core seq in the super block. Remove its specific next_trans_seq field. While making this change we switch to only using le64 in the network message payloads, the rest of the processing now uses natural u64s. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:46:19 -07:00
Zach Brown	9051ceb6fc	Add core seq to the super block Add a new seq field to the super block which will be the source of all incremented seqs throughout the system. We give out incremented seqs to callers with an atomic64_t in memory which is synced back to the super block as we commit transactions in the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:33:30 -07:00
Zach Brown	bad1c602f9	server hold_commit returns void When we moved to the current allocator we fixed up the server commit path to initialize the pair of allocators as a commit is finished rather than before it starts. This removed all the error cases from hold_commit. Remove the error handling from hold_commit calls to make the system just a bit simpler. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:32:26 -07:00
Zach Brown	38a4a56741	Stop writing to other quorum slot blocks The core quorum work loop assumes that it has exclusive access to its slot's quorum block. It uniquely marks blocks it writes and verifies the marks on read to discover if another mount has written to its slot under the assumption that this must be a configuration error that put two mounts in the same slot. But the design of the leader bit in the block violates the invariant that only a slot will write to its block. As the server comes up and fences previous leaders it writes to their block to clear their leader bit. The final hole in the design is that because we're fencing mounts, not slots, each slot can have two mounts in play. An active mount can be using the slot and there can still be a persistent record of a previous mount in the slot that crashed that needs to be fenced. All this comes together to have the server fence an old mount in a slot while a new mount is coming up. The new mount sees the mark change and freaks out and stops participating in quorum. The fix is to rework the quorum blocks so that each slot only writes to its own block. Instead of the server writing to each fenced mount's slot, it writes a fence event to its block once all previous mounts have been fenced. We add a bit of bookkeeping so that the server can discover when all block leader fence operations have completed. Each event gets its own term so we can compare events to discover live servers. We get rid of the write marks and instead have an event that is written as a quorum agent starts up and is then checked on every read to make sure it still matches. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-31 13:10:45 -07:00
Zach Brown	1199bac91d	Fix quorum server shutdown If the server shuts down it calls into quorum to tell it that the server has exited. This stops quorum from sending heartbeats that suppress other leader elections. The function that did this got the logic wrong. It was setting the bit instead of clearing it, having been initially written to set a bit when the server exited. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:39 -07:00
Zach Brown	877e30d60f	Add client address to mounted_client item Add the peername of the client's connected socket to its mounted_client item as it mounts. If the client doesn't recover then fencing can use the IP to find the host to fence. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:39 -07:00
Zach Brown	0706669047	Clean up quorum block read error messages The error messages from reading quorum blocks were confusing. The mark was being checked when the block had already seen an error, and we got multiple messages for some errors. This cleans it up a bit so we only get one error message for each error source and each message contains relevant context. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	76cef6fdfc	Let _recov_next_pending iterate over rids Currently the server's recovery timeout work synchronously reclaims resources for each client whose recovery timed out. scoutfs_recov_next_pending() can always return the head of the pending list because its caller will always remove it from the list as it iterates. As we move to real fencing the server will be creating fence requests for all the timed out clients concurrently. It will need to iterate over all the rids for clients in recovery. So we sort recovery's pending list by rid and change _recov_next_pending to return the next pending rid after a rid argument. This lets the server iterate over all the pending rids at once. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	933fc687c3	omap remove_rid might not find entry Client recovery in the server doesn't add the omap rid for all the clients that it's waiting for. It only adds the rid as they connect. A client whose recovery timeout expires and is evicted will try to have its omap rid removed without being added. Today this triggers a warning and returns an error from a time when the omap rid lifecycle was more rigid. Now that it's being called by the server's reclaim_rid, along with a bunch of other functions that succeed if called for non-existant clients, let's have the omap remove_rid do the same. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	ab5466a771	Protect server shutting down with smp barriers I saw a confusing hang that looked like a lack of ordering between a waker setting shutting_down and a wait event testing it after being woken up. Let's see if more barriers help. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	f3764b873b	Save previous connected client address Our connection state spans sockets that can disconnect and reconnect. While sockets are connected we store the socket's remote address in the connection's peername and we clear it as sockets disconnect. Fencing wants to know the last connected address of the mount. It's a bit of metadata we know about the mount that can be used to find it and fence it. As we store the peer address we also stash it away as the last known peer address for the socket. Fencing can then use that instead of the current socket peer address which is guaranteed to be uninitialized because there's no socket connected. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	9ebc9d0f66	Manage client reconnect delay The client currently always queues immediate connect work if it's nodify_down is called. It was assuming that notify_down is only called from a healthy established connection. But it's also called for unsuccessful conneect attempts that might not have timed out. Say the host is up but the port isn't listening. This results in spamming connection attempts while an old stale leader block until a new server is elected, fences the previous leader, and updates their quorum block. The fix is to explicitly manage the connection work queueing delay. We only set it to immediately queue on mount and when we see a greeting reply from the server. We always set it to a longer timeout as we start a connection attempt. This means we'll always have a long reconnect delay unless we really connected to a server. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	943351944a	Call fencing from the server The server is responsible for calling the fencing subsystem. It is the source of fencing requests as it decides that previous mounts are unresponsive. It is responsible for reclaiming resources for fenced mounts and freeing their associated fence request. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:28 -07:00
Zach Brown	b060eb4f5d	Add fencing subsystem Add the subsystem which tracks pending fence requests and exposes them to userspace for processing. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:25 -07:00
Zach Brown	2dde729791	Add sysfs create attr w/ parent Add sysfs attribute creation that can provide the parent dir kobject instead of always creating the sysfs object dir off of the main per-mount dir. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:19 -07:00
Zach Brown	ccb7c0bf4b	Add rw sysfs attr wrapper Add a wrapper around __ATTR_RW so that callers can add attributes with a _store function. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:07 -07:00
Zach Brown	e9d04dcf8d	Add forced unmount support Add super_ops->umount_begin so that we can implement a forced unmount which tries to avoid issuing any more network or storage ops. It can return errors and lose unsynchronized data. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:02:20 -07:00
Zach Brown	54644a5074	Add data_alloc_zone_blocks volume option Add the data_alloc_zone_blocks volume option. This changes the behaviour of the server to try and give mounts free data extents which fall in exclusive fixed-size zones. We add the field to the scoutfs_volume_options struct and add it to the set_volopt server handler which enforces constrains on the size of the zones. We then add fields to the log_trees struct which records the size of the zones and sets bits for the zones that contain free extents in the data_avail allocator root. The get_log_trees handler is changed to read all the zone bitmaps from all the items, pass those bitmaps in to _alloc_move to direct data allocations, and finally update the bitmaps in the log_trees items to cover the newly allocated extents. The log_trees data_alloc_zone fields are cleared as the mount's logs are reclaimed to indicate that the mount is no longer writing to the zone. The policy mechanism of finding free extents based on the bitmaps is ipmlemented down in _data_alloc_move(). Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	52c2a465db	Add zone awareness to scoutfs_alloc_move() Add parameters so that scoutfs_alloc_move() can first search for source extents in specified zones. It uses relatively cheap searches through the order items to find extents that intersect with the regions described by the zone bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00

1 2 3 4 5 ...

1051 Commits