scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-07 04:26:29 +00:00

Author	SHA1	Message	Date
Auke Kok	1d150da3f0	Use page->lru instead of page->list With v3.14-rc1-10-g34bf6ef94a83, page->list is removed Instead, use the union member ->lru. Signed-off-by: Auke Kok <auke.kok@versity.com>	2023-10-09 15:35:40 -04:00
Zach Brown	2e2ccb6f61	Allow replaying srch file rotation When a client no longer needs to append to a srch file, for whatever reason, we move the reference from the log_trees item into a specific srch file btree item in the server's srch file tracking btree. Zeroing the log_trees item and inserting the server's btree item are done in a server commit and should be written atomically. But commit_log_trees had an error handling case that could leave the newly inserted item dirty in memory without zeroing the srch file reference in the existing log_trees item. Future attempts to rotate the file reference, perhaps by retrying the commit or by reclaiming the client's rid, would get EEXIST and fail. This fixes the error handling path to ensure that we'll keep the dirty srch file btree and log_trees item in sync. The desynced items can still exist in the world so we'll tolerate getting EEXIST on insertion. After enough time has passed, or if repair zeroed the duplicate reference, we could remove this special case from insertion. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-17 14:33:27 -08:00
Zach Brown	fff07ce19c	Use stale block read retrying helper Transition from manual checking for persistent ESTALE to the shared helper that we just added. This should not change behavior. Signed-off-by: Zach Brown <zab@versity.com>	2022-12-12 14:59:22 -08:00
Zach Brown	b477604339	Don't clobber srch compact errors The srch compaction worker will wait a bit before attempting another compaction as it finishes a compaction that failed. Unfortunately, it clobbered the errors it got during compaction with the result of sending the commit to the server with the error flag. If the commit is successful then it thinks there were no errors and immediately re-queues itself to try the next compaction. If the error is persistent, as it was with a bug in how we merged log files with a single page's worth of entries, then we can spin indefinitely getting and error, clobbering the error with the commit result, and immediately queueing our work to do it all over again. This fix preserves existing errors when geting the result of the commit and will correctly back off. If we get persistent merge errors at least they won't consume significant resources. We add a counter for commit for the errors so we can get some visibility if this happens. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	75f9aabe75	Allow compacting logs down to a single page The k-way merge function at the core of the srch file entry merging had some bookkeeping math (calculating number of parents) that couldn't handle merging a single incoming entry stream, so it threw a warning and returned an error. When refusing to handle that case, it was assuming that caller was trying to merge down a single log file which doesn't make any sense. But in the case of multiple small unsorted logs we can absolutely end up with their entries stored in one sorted page. We have one sorted input page that's merging multiple log files. The merge function is also the path that writes to the output file so we absolutely need to handle this case. We more carefully calculate the number of parents, clamping it to one parent when we'd otherwise get "(roundup(1) -> 1) - 1 == 0" when calculating the number of parents from the number of inputs. We can relax the warning and error to refuse to merge nothing. The test triggers this case by putting single search entries in the log files for mounts and unmounting them to force rotation of the mount log files into mergable rotated log files. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	d5eec7d001	Fix uninitialized srch ret that won't happen More recent gcc notices that ret in delete_files can be undefined if nr is 0 while missing that we won't call delete_files in that case. Seems worth fixing, regardless. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	28759f3269	Rotate srch files as log trees items are reclaimed The log merging work deletes log trees items once their item roots are merged back into the fs root. Those deleted items could still have populated srch files that would be lost. We force rotation of the srch files in the items as they're reclaimed to turn them into rotated srch files that can be compacted. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:37:45 -07:00
Zach Brown	1259f899a3	srch compaction needs to prepare alloc for commit The srch client compaction work initializes allocators, dirties blocks, and writes them out as its transaction. It forgot to call the pre-commit allocator prepare function. The prepare function drops block references used by the meta allocator during the transaction. This leaked block references which kept blocks from being freed by the shrinker under memory pressure. Eventually memory was full of leaked blocks and the shrinker walked all of them looking blocks to free, resulting in an effective livelock that ground the system to a crawl. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-01 13:04:40 -07:00
Zach Brown	6237f0adc5	Add _block_dirty_ref to dirty blocks in one place To create dirty blocks in memory each block type caller currently gets a reference on a created block and then dirties it. The reference it gets could be an existing cached block that stale readers are currently using. This creates a problem with our block consistency protocol where writers can dirty and modify cached blocks that readers are currently reading in memory, leading to read corruption. This commit is the first step in addressing that problem. We add a scoutfs_block_dirty_ref() call which returns a reference to a dirtied block from the block core in one call. We're only changing the callers in this patch but we'll be reworking the dirtying mechanism in an upcoming patch to avoid corrupting readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	0969a94bfc	Check one block_ref struct in block core Each of the different block types had a reading function that read a block and then checked their reference struct for their block type. This gets rid of each block reference type and has a single block_ref type which is then checked by a single ref reading function in the block core. By putting ref checking in the core we no longer have to export checking the block header crc, verifying headers, invalidating blocks, or even reading raw blocks themseves. Everyone reads refs and leaves the checking up to the core. The changes don't have a significant functional effect. This is mostly just changing types and moving code around. (There are some changes to visible counters.) This shares code, which is nice, but this is putting the block reference checking in one place in the block core so that in a few patches we can fix problems with writers dirtying blocks that are being read. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	d39268bbc1	Fix spurious EIO from scoutfs_srch_get_compact scoutfs_srch_get_compact() is building up a compaction request which has a list of srch files to read and sort and write into a new srch file. It finds input files by searching for a sufficient number of similar files: first any unsorted log files and then sorted log files that are around the same size. It finds the files by using btree next on the srch zone which has types for unsorted srch log files, sorted srch files, but also pending and busy compaction items. It was being far too cute about iterating over different key types. It was trying to adapt to finding the next key and was making assumptions about the order of key types. It didn't notice that the pending and busy key types followed log and sorted and would generate EIO when it ran into them and found their value length didn't match what it was expecting. Rework the next item ref parsing so that it returns -ENOENT if it gets an unexpected key type, then look for the next key type when checking enoent. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	18aee0ebbd	scoutfs: fix lost entries in resumed srch compact Compacting very large srch files can use all of a given operation's metadata allocator. When this happens we record the position in the srch files of the compcation in the pending item. We could lose entries when this happens because the kway_next callback would advance the srch file position as it read entries and put them in the tournament tree leaves, not as it put them in the output file. We'd continue from the entries that were next to go in the tournament leaves, not from what was in the leaves. This refactors the kway merge callbacks to differentiate between getting entries at the position and advancing the positions. We initialize the tournament leaves by getting entries at the positions and only advance the position as entries leave the tournament tree and are either stored in the output srch files or are dropped. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	c35f1ff324	scoutfs: inc end when search xattrs retries In the rare case that searching for xattrs only finds deletions within its window it retries the search past the window. The end entry is inclusive and is the last entry that can be returned. When retrying the search we need to start from the entry after that to ensure forward progress. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	6770a31683	scoutfs: consistently trim srch entry range We have to limit the number of srch entries that we'll track while performing a search for all the inodes that contain xattrs that match the search hash value. As we hit the limit on the number of entries to track we have to drop entries. As we drop entries we can't return any inodes for entries past the dropped entries. We were updating the end point of the search as we dropped entries past the tracked set, but we weren't updating the search end point if we dropped the last currently tracked entry. And we were setting the end point to the dropped entry, not to the entry before it. This could lead us to spuriously returning deleted entries if we drop the creation entry and then allow tracking its deletion later. This fixes both those problems. We now properly set the end point to just before the dropped entry for all entries that we drop. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	9395360324	scoutfs: add srch entry inc/dec We're going to need to increment and decrement srch entries in coming fixes. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	7c5823ad12	scoutfs: drop duplicate compacted srch entries The k-way merge used by srch file compaction only dropped the second entry in a pair of duplicate entries. Duplicate entries are both supposed to be removed so that entries for removed xattrs don't take up space in the files. This both drops the second entry and removes the first encoded entry. As we encode entries we rememeber their starting offset and the previous entry that they were encoded from. When we hit a duplicate entry we undo the encoding of the previous entry. This only works wihin srch file blocks. We can still have duplicate entries that span blocks but that's unlikely and relatively harmless. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	560c91a0e4	scoutfs: fix binary search for sorted srch block The search_xattrs ioctl looks for srch entries in srch files that map the caller's hashed xattr name to inodes. As it searches it maintains a range of entries that it is looking for. When it searches sorted srch files for entries it first performs a binary search for the start of the range and then iterates over the blocks until it reaches the end of its range. The binary search for the start of the range was a bit wrong. If the start of the range was less than all the blocks then the binary search could wrap the left index, try to get a file block at a negative index, and return an error for the search. This is relatively hard to hit in practice. You have to search for the xattr name with the smallest hashed value and have a sorted srch file that's just the right size so that blk offset 0 is the last block compared in the binary search, which sets the right index to -1. If there are lots of xattrs, or sorted files of the wrong length, it'll work. This fixes the binary search so that it specifically records the first block offset that intersects with the range and tests that the left and right offsets haven't been inverted. Now that we're not breaking out of the binary search loop we can more obviously put each block reference that we get. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	4647a6ccb2	scoutfs: fix srch btree iref puts The srch code was putting btree item refs outside of success. This is fine, but they only need to be put when btree ops return success and have set the reference. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	ae286bf837	scoutfs: update srch _alloc_meta_low callers The srch system checks that is has allocator space while deleting srch files and while merging them and dirtying output blocks. Update the callers to check for the correct number of avail or freed blocks that it needs between each check. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-02 09:25:13 -08:00
Zach Brown	a5d9ac5514	scoutfs: rework scoutfs_alloc_meta_low, takes arg Previously, scoutfs_alloc_meta_lo_thresh() returned true when a small static number of metadata blocks were either available to allocate or had space for freeing. This didn't make a lot of sense as the correct number depends on how many allocations each caller will make during their atomic transaction. Rework the call to take an argument for the number of avail or freed blocks available to test. This first pass just uses the existing number, we'll get to the callers. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-02 09:25:13 -08:00
Andy Grover	cf278f5fa0	scoutfs: Tidy some enum usage Prefer named to anonymous enums. This helps readability a little. Use enum as param type if possible (a couple spots). Remove unused enum in lock_server.c. Define enum spbm_flags using shift notation for consistency. Rename get_file_block()'s "gfb" parameter to "flags" for consistency. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-30 13:35:44 -08:00
Zach Brown	736d9d7df8	scoutfs: remove struct scoutfs_log_trees_val The log_trees structs store the data that is used by client commits. The primary struct is communicated over the wire so it includes the rid and nr that identify the log. The _val struct was stored in btree item values and was missing the rid and nr because those were stored in the item's key. It's madness to duplicate the entire struct just to shave off those two fields. We can remove the _val struct and store the main struct in item values, including the rid and nr. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Zach Brown	7a3749d591	scoutfs: incremental srch compaction Previously the srch compaction work would output the entire compacted file and delete the input files in one atomic commit. The server would send the input files and an allocator to the client, and the client would send back an output file and an allocator that included the deletion of the input files. The server would merge in the allocator and replace the input file items with the output file item. Doing it this way required giving an enormous allocation pool to the client in a radix, which would deal with recursive operations (allocating from and freeing to the radix that is being modified). We no longer have the radix allocator, and we use single block avail/free lists instead of recursively modifying the btrees with free extent items. The compaction RPC needs to work with a finite amount of allocator resources that can be stored in an alloc list block. The compaction work now does a fixed amount of work and a compaction operation spans multiple work iterations. A single compaction struct is now sent between the client and server in the get_compact and commit_compact messages. The client records any partial progress in the struct. The server writes that position into PENDING items. It first searchs for pending items to give to clients before searching for files to start a new compaction operation. The compact struct has flags to indicate whether the output file is being written or the input files are being deleted. The server manages the flags and sets the input file deletion flag only once the result of the compaction has been reflected in the btree items which record srch files. We added the progress fields to the compaction struct, making it even bigger than it already was, so we take the time to allocate them rather than declaring them on the stack. It's worth mentioning that each operation now takes a reasonably bounded amount of time will make it feasible to decide that it has failed and needs to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	f8e1812288	scoutfs: add srch infrastructure This introduces the srch mechanism that we'll use to accelerate finding files based on the presence of a given named xattr. This is an optimized version of the initial prototype that was using locked btree items for .indx. xattrs. This is built around specific compressed data structures, having the operation cost match the reality of orders of magnitude more writers than readers, and adopting a relaxed locking model. Combine all of this and maintaining the xattrs no longer tanks creation rates while maintaining excellent search latencies, given that searches are defined as rare and relatively expensive. The core data type is the srch entry which maps a hashed name to an inode number. Mounts can append entries to the end of unsorted log files during their transaction. The server tracks these files and rotates them into a list of files as they get large enough. Mounts have compaction work that regularly asks the server for a set of files to read and combine into a single sorted output file. The server only initiates compactions when it sees a number of files of roughly the same size. Searches then walk all the commited srch files, both log files and sorted compacted files, looking for entries that associate an xattr name with an inode number. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00

25 Commits