scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-27 16:45:06 +00:00

Author	SHA1	Message	Date
Zach Brown	0969a94bfc	Check one block_ref struct in block core Each of the different block types had a reading function that read a block and then checked their reference struct for their block type. This gets rid of each block reference type and has a single block_ref type which is then checked by a single ref reading function in the block core. By putting ref checking in the core we no longer have to export checking the block header crc, verifying headers, invalidating blocks, or even reading raw blocks themseves. Everyone reads refs and leaves the checking up to the core. The changes don't have a significant functional effect. This is mostly just changing types and moving code around. (There are some changes to visible counters.) This shares code, which is nice, but this is putting the block reference checking in one place in the block core so that in a few patches we can fix problems with writers dirtying blocks that are being read. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	b1b75cbe9f	Fix block cache shrink and read racing crash The block cache wasn't safely racing readers walking the rcu radix_tree and the shrinker walking the LRU list. A reader could get a reference to a block that had been removed from the radix and was queued for freeing. It'd clobber the free's llist_head union member by putting the block back on the lru and both the read and free would crash as they each corrupted each other's memory. We rarely saw this in heavy load testing. The fix is to clean up the use of rcu, refcounting, and freeing. First, we get rid of the LRU list. Now we don't have to worry about resolving racing accesses of blocks between two independent structures. Instead of shrinking walking the LRU list, we can mark blocks on access such that shrinking can walk all blocks randomly and expect to quickly find candidates to shrink. To make it easier to concurrently walk all the blocks we switch to the rhashtable instead of the radix tree. It also has nice per-bucket locking so we can get rid of the global lock that protected the LRU list and radix insertion. (And it isn't limited to 'long' keys so we can get rid of the check for max meta blknos that couldn't be cached.) Now we need to tighten up when read can get a reference and when shrink can remove blocks. We have presence in the hash table hold a refcount but we make it a magic high bit in the refcount so that it can be differentiated from other references. Now lookup can atomically get a reference to blocks that are in the hash table, and shrinking can atomically remove blocks when it is the only other reference. We also clean up freeing a bit. It has to wait for the rcu grace period to ensure that no other rcu readers can reference the blocks its freeing. It has to iterate over the list with _safe because it's freeing as it goes. Interestingly, when reworking the shrinker I noticed that we weren't scaling the nr_to_scan from the pages we returned in previous shrink calls back to blocks. We now divide the input from pages back into blocks. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:15 -08:00
Zach Brown	8e34c5d66a	Use quorum slots and background election work Previously quorum configuration specified the number of votes needed to elected the leader. This was an excessive amount of freedom in the configuration of the cluster which created all sorts of problems which had to be designed around. Most acutely, though, it required a probabilistic mechanism for mounts to persistently record that they're starting a server so that future servers could find and possibly fence them. They would write to a lot of quorum blocks and trust that it was unlikely that future servers would overwrite all of their written blocks. Overwriting was always possible, which would be bad enough, but it also required so much IO that we had to use long election timeouts to avoid spurious fencing. These longer timeouts had already gone wrong on some storage configurations, leading to hung mounts. To fix this and other problems we see coming, like live membership changes, we now specifically configure the number and identity of mounts which will be participating in quorum voting. With specific identities, mounts now have a corresponding specific block they can write to and which future servers can read from to see if they're still running. We change the quorum config in the super block from a single quorum_count to an array of quorum slots which specify the address of the mount that is assigned to that slot. The mount argument to specify a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr" which specifies the mount's slot. The slot's address is used for udp election messages and tcp server connections. Now that we specifically have configured unique IP addresses for all the quorum members, we can use UDP messages to send and receive the vote mesages in the raft protocol to elect a leader. The quorum code doesn't have to read and write disk block votes and is a more reasonable core loop that either waits for received network messages or timeouts to advance the raft election state machine. The quorum blocks are now used for slots to store their persistent raft term and to set their leader state. We have event fields in the block to record the timestamp of the most recent interesting events that happened to the slot. Now that raft doesn't use IO, we can leave the quorum election work running in the background. The raft work in the quorum members is always running so we can use a much more typical raft implementation with heartbeats. Critically, this decouples the client and election life cycles. Quorum is always running and is responsible for starting and stopping the server. The client repeatedly tries to connect to a server, it has nothing to do with deciding to participate in quorum. Finally, we add a quorum/status sysfs file which shows the state of the quorum raft protocol in a member mount and has the last messages that were sent to or received from the other members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-18 12:57:30 -08:00
Andy Grover	355eac79d2	Retry if transaction cannot alloc for fallocate or write Add a new distinguishable return value (ENOBUFS) from allocator for if the transaction cannot alloc space. This doesn't mean the filesystem is full -- opening a new transaction may result in forward progress. Alter fallocate and get_blocks code to check for this err val and retry with a new transaction. Handling actual ENOSPC can still happen, of course. Add counter called "alloc_trans_retry" and increment it from both spots. Signed-off-by: Andy Grover <agrover@versity.com> [zab@versity.com: fixed up write_begin error paths]	2021-01-25 09:32:01 -08:00
Zach Brown	dc47ec65e4	scoutfs: remove btree value owner footer offset We were using a trailing owner offset to iterate over btree item values from the back of the block towards the front. We did this to reclaim fragmented free space in a block to satisfy an allocation instead of having to split the block, which is expensive mostly because it has to allocate and free metadata blocks. In the before times, we used to compact items by sorting items by their offset, moving them, and then sorting them by their keys again. The sorting by keys was expensive so we added these owner offsets to be able to compact without sorting. But the complexity of maintaining the owner metadata is not worth it. We can avoid the expensive sorting by keys by allocating a temporary array of item offsets and sorting only it by the value offset. That's nice and quick, it was the key comparisons that were expensive. Then we can remove the owner offset entirely, as well as the block header final free region that compaction needed. And we also don't compact as often in the modern era because we do the bulk of our work in the item cache instead of in the btree, and we've changed the split/merge/compaction heuristics to avoid constantly splitting/merging/comapcting and an item population happens to hover right around a shared threshold. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	7a3749d591	scoutfs: incremental srch compaction Previously the srch compaction work would output the entire compacted file and delete the input files in one atomic commit. The server would send the input files and an allocator to the client, and the client would send back an output file and an allocator that included the deletion of the input files. The server would merge in the allocator and replace the input file items with the output file item. Doing it this way required giving an enormous allocation pool to the client in a radix, which would deal with recursive operations (allocating from and freeing to the radix that is being modified). We no longer have the radix allocator, and we use single block avail/free lists instead of recursively modifying the btrees with free extent items. The compaction RPC needs to work with a finite amount of allocator resources that can be stored in an alloc list block. The compaction work now does a fixed amount of work and a compaction operation spans multiple work iterations. A single compaction struct is now sent between the client and server in the get_compact and commit_compact messages. The client records any partial progress in the struct. The server writes that position into PENDING items. It first searchs for pending items to give to clients before searching for files to start a new compaction operation. The compact struct has flags to indicate whether the output file is being written or the input files are being deleted. The server manages the flags and sets the input file deletion flag only once the result of the compaction has been reflected in the btree items which record srch files. We added the progress fields to the compaction struct, making it even bigger than it already was, so we take the time to allocate them rather than declaring them on the stack. It's worth mentioning that each operation now takes a reasonably bounded amount of time will make it feasible to decide that it has failed and needs to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	33374d8fe6	scoutfs: get statfs free blocks with alloc_foreach Use alloc_foreach to count the free blocks in all the allocators instead of sending an RPC to the server. We cache the results so that constant df calls don't generate a constant stream of IO. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	c4663ea1a1	scoutfs: compact items in item cache pages The first pass of the item cache didn't try to reclaim freed space at all. It would leave behind very sparse pages. The oldest of which would be reclaimed by memory pressure. While this worked, it created much more stress on the system than is necessary. Splitting a page with one key also makes it hard to calculate the boundaries of the split pages, given that the start and end keys could be the single item. This adds a header field which tracks the free space in item cache pgaes. Free space is created before the alloc offset by removing items from the rbtree, but also from shrinking item values when updating or deleting items. If we try to split a page with sufficient free space to insert the largest possible item then we compact the page instead of splitting it. We copy the items into the front of an unused page and swap the pages. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	c61175e796	scoutfs: remove unused radix code Remove the radix allocator that was added as we expermented with packed extent items. It didn't work out. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	8f946aa478	scoutfs: add btree item extent allocator Add an allocator which uses btree items to store extents. Both the client and server will use this for btree blocks, the client will use it for srch blocks and data extents, and the server will move extents between the core fs allocator btree roots and the clients' roots. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	b605407c29	scoutfs: add extent layer Add infrastructure for working with extents. Callers provide callbacks which operate on their extent storage while this code performs the fiddly splitting and merging of extents. This layer doesn't have any persitent structures itself, it only operates on native structs in memory. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	b28acdf904	scoutfs: use larger percpu_counter batch The percpu_counter library merges the per-cpu counters with a shared count when the per-cpu counter gets larger than a certain value. The default is very small, so we often end up taking a shared lock to update the count. Use a larger batch so that we take the lock less often. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	12067e99ab	scoutfs: remove item granular work from forest Now that the item cache is bearing the load of high frequency item calls, we can remove all the item granular work that the forest was trying to do. The item cache amortizes the cost of the forest so its remaining methods can go straight to the btrees and don't need complicated state to reduce the overhead of item calls. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	45e594396f	scoutfs: add an item cache above the btrees Add an item cache between fs callers and the forest of btrees. Calling out to the btrees for every item operation was far too expensive. This gives us a flexible in-memory structure for working with items that isn't bound by the constrants of persistent block IO. We can rarely stream large groups of items to and from the btrees and then use efficient kernel memory structures for more frequent item operations. This adds the infrastructure, nothing is calling it yet. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	b1757a061e	scoutfs: add forest methods for item cache Add forest calls that the item cache will use. It needs to read all the items in the leaf blocks of forest btree which could contain the key, write dirty items to the log btree, and dirty bits in the bloom block as items are dirtied. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	d1e62a43c9	scoutfs: fix leaking alloc bits in merge In a merge where the input and source trees are the same, the input block can be an initial pre-cow version of the dirty source block. Dirtying blocks in the change will clear allocations in the dirty source block but they will remain in the pre-cow input block. The merge can then set these blocks in the dst, even though they were also used by allocation, because they're still set in the pre-cow input block. This fix is clumsy, but minimal and specific to this problem. A more thorough fix is being worked on which introduces more staging more allocator trees and should stop calls that are modifying the current active avail or free trees. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ca6b7f1e6d	scoutfs: lock invalidate only syncs dirty Lock invalidation has to make sure that changes are visible to future readers. It was syncing if the current transaction is dirty. This was never optimal, but it wasn't catastrophic when concurrent invalidation work could all block on one sync in progress. With the move to a single invalidation worker serially invalidating locks it became unacceptable. Invalidation happening in the presence of writers would constantly sync the current transaction while very old unused write locks were invalidated. Their changes had long since been committed in previous transactions. We add a lock field to remember the transaction sequence which could have been dirtied under the lock. If that transaction has already been comitted by the time we invalidate the lock it doesn't have to sync. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	55dde87bb1	scoutfs: fix lock invalidation work deadlock The client lock network message processing callbacks were built to simply perform the processing work for the message in the networking work context that it was called in. This particularly makes sense for invalidation because it has to interact with other components that require blocking contexts (syncing commits, invalidating inodes, truncating pages, etc). The problem is that these messages are per-lock. With the right workloads we can use all the capacity for executing work just in lock invalidation work. There is no more work execution available for other network processing. Critically, the blocked invalidation work is waiting for the commit thread to get its network responses before invalidation can make forward progress. I was easily reproducing deadlocks by leaving behind a lot of locks and then triggering a flood of invalidation requests on behalf of shrinking due to memory pressure. The fix is to put locks on lists and have a small fixed number of work contexts process all the locks pending for each message type. The network callbacks don't block, they just put the lock on the list and queue the work that will walk the lists. Invalidation now blocks one work context, not the number of incoming requests. There were some wait conditions in work that used to use the lock workq. Other paths that change those conditions now have to know to queue the work specifically, not just wake tasks which included blocked work executors. The other subtle impact of the change is that we can no longer rely on networking to shutdown message processing work that was happening in its callbacks. We have to specifically stop our work queues in _shutdown. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	9658412d09	scoutfs: add forest counters Add a bunch of counters to track significant events in the forest. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	f8bf1718a0	scoutfs: add a bunch of btree counters Add some counters for the most basic btree events. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	f8e1812288	scoutfs: add srch infrastructure This introduces the srch mechanism that we'll use to accelerate finding files based on the presence of a given named xattr. This is an optimized version of the initial prototype that was using locked btree items for .indx. xattrs. This is built around specific compressed data structures, having the operation cost match the reality of orders of magnitude more writers than readers, and adopting a relaxed locking model. Combine all of this and maintaining the xattrs no longer tanks creation rates while maintaining excellent search latencies, given that searches are defined as rare and relatively expensive. The core data type is the srch entry which maps a hashed name to an inode number. Mounts can append entries to the end of unsorted log files during their transaction. The server tracks these files and rotates them into a list of files as they get large enough. Mounts have compaction work that regularly asks the server for a set of files to read and combine into a single sorted output file. The server only initiates compactions when it sees a number of files of roughly the same size. Searches then walk all the commited srch files, both log files and sorted compacted files, looking for entries that associate an xattr name with an inode number. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	4d0b78f5cb	scoutfs: add counters for server commits Add some counters for server commits. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	8fe683dab8	scoutfs: cow dirty radix blocks instead of moving The radix allocator has to be careful to not get lost in recursion trying to allocate metadata blocks for its dirty radix blocks while allocating metadata blocks for others. The first pass had used path data structures to record the references to all the blocks we'd need to modify to reflect the frees and allocations performed while dirtying radix blocks. Once it had all the path blocks it moved the old clean blocks into new dirty locations so that the dirtying couldn't fail. This had two very bad performance implications. First, it meant that trying to read clean versions of dirtied trees would always read the old blocks again because their clean version had been moved to the dirty version. Typically this wouldn't happen but the server does exactly this every time it tries to merge freed blocks back into its avail allocator. This created a significant IO load on the server. Secondly, that block cache move not being allowed to fail motivated us to move to a locked rbtree for the block cache instead of the lockless rcu radix_tree. This changes the recursion avoidance to use per-block private metadata to track every block that we allocate and cow rather than move. Each dirty block knows its parent ref and the blknos it would clear and set. If dirtying fails we can walk back through all the blocks we dirty and restore their original references before dropping all the dirty blocks and returning an error. This lets us get rid of the path structure entirely and results in a much cleaner system. This change meant tracking free blocks without clearing them as they're used to satisfy dirty block allocations. The change now has a cursor that walks the avail metadata tree without modifying it. While building this it became clear that tracking the first set bits of refs doesn't provide any value if we're always searching from a cursor. The cursor ends up providing the same value of avoiding constantly searching empty initial bits and refs. Maintaining the first metadata was just overhead. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	6d7b8233c6	scoutfs: add radix merge retry counter Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	26ccaca80b	scoutfs: add commit written counter Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	b7943c5412	scoutfs: avoid reading forest roots with block IO The forest item operations were reading the super block to find the roots that it should read items from. This was easiest to implement to start, but it is too expensive. We have to find the roots for every newly acquired lock and every call to walk the inode seq indexes. To avoid all these reads we first send the current stable versions of the fs and logs btrees roots along with root grants. Then we add a net command to get the current stable roots from the server. This is used to refresh the roots if stale blocks are encountered and on the seq index queries. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	e3b1f2e2b0	scoutfs: add counters for radix enospc Add counters for the various sources of ENOSPC from the radix block allocator. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-22 16:08:03 -07:00
Zach Brown	9ad86d4d29	scoutfs: commit trans before premature enospc File data allocations come from radix allocators which are populated by the server before each client transation. It's possible to fully consume the data allocator within one transaction if the number of dirty metadata blocks is kept low. This could result in premature ENOSPC. This was happening to the archive-light-cycle test. If the transactions performed by previous tests lined up just right then the creation of the initial test files could see ENOSPC and cause all sorts of nonsense in the rest of the test, culminating in cmp commands stuck in offline waits. This introduces high and low data allocator water marks for transactions. The server tries to fill data allocators for each transaction to the high water mark and the client forces the commit of a transaction if its data allocator falls below the low water mark. The archive-light-cycle test now passes easily and we see the trans_commit_data_alloc_low counter increasing during the test. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-22 16:08:03 -07:00
Zach Brown	e8b0bbc619	scoutfs: remove unused counters Remove a bunch of unused counters which have accumulated over time as we've worked on the code and forgotten to remove counters. Signed-off-by: Zach Brown <zab@versity.com>	2020-03-05 09:02:06 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	edd8fe075c	scoutfs: remove lsm code Remove all the now unused code that deals with lsm: segment IO, the item cache, and the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	58f062a2c1	scoutfs: use forest in locking and transaction Transaction commit now has to ask the forest to write the btrees during a transaction commit instead of writing dirty items in segments. It also determines if holds fit in the dirty transaction by looking at dirty btree blocks instead of item counts. Locking no longer has to invalidate a private item cache because the forest paths use the btree block cache where inconsistency is discovered and invalidated as blocks are read. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	d20c950c17	scoutfs: restore our block cache Previous versions of the system had a simple block cache. This brings it back with support for blocks that are larger than page size, a more efficient LRU, and an explicit writer context. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	5b258cee3b	scoutfs: refine quorum voting The current quorum voting implementatoin had some rough edges that increased the complexity of the system and introduced undesirable failure modes. We can keep the same basic pattern but move functionality around a few places, and rethink the quorum voting, to end up with a meaningfully simpler system. The motivation for this work was to remove the need to provide a uniq_name option for every mount instance. The first big change is to remove the idea of static configuration slots for mounts. This removes the use of uniq_name. Mounts now simply have a server_addr mount option instead of using their uniq_name to find their address in the configuration. The server can't check the configuration to see if a given connected client's name is found in the quorum config. Clients can set a flag in their sent greeting which indicates that they're a voter. This removes the uniq_name from the greeting and mounted client records. Without a static configuration mounts no longer have dedicated block locations to write to. We increase the size of the region of quorum blocks and have voters simply write to a random block. Overwriting vote blocks is OK because we move from heartbeating design patterns to a protocol strongly based on raft's election. We're using quorum blocks to communicate votes instead of network messages and overwriting blocks is analagous to lossy networks droping vote messages in the raft election protocol. We were using the dedicated per-mount quorum blocks to track mounts that had been elected and needed to be fenced. We no longer have that storage so instead we add the idea of an election log that is stored in every voting block. Readers merge the logs from all the blocks they read and write the resulting merged log in their block. With no static quorum configuration we no longer have to worry about the complexity of changing the slot configurations while they're in use. The only persistent configuration is the number of votes a candidate needs to be elected by a quorum. It was a mistake to use quorum voting blocks to communicate state between the server and the quorum voters. We can easily move the unmount_barrier, server address, and fencing state from the quorum blocks into the super block. The server no longer needs the quorum election info struct to be able to later write its quorum block. It instead writes a few fields in the super. There's only one place where clients need to look to find out who they should connect to or if they can finish unmount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	c34dd452a7	scoutfs: add quorum voting Add a quorum election implementation. The mounts that can participate in the election are specified in a quorum config array in the super block. Each configured participant is assigned a preallocated block that it can write to. All mounts read the quorum blocks to find the member who was elected the leader and should be running the server. The voting mounts loop reading voting blocks and writing their vote block until someone is elected with a amjority. Nothing calls this code yet, this adds the initial implementation and format. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	ed9f4b6a22	scoutfs: calculate and enforce segment csum We had fields in the segment header for the crc but weren't using it. This calculates the crc on write and verifies it on read. The crc covers the used bytes in the segment as indicated by the total_bytes field. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-21 13:28:36 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	04660dbfee	scoutfs: add scoutfs_extent_prev() Add an extent function for iterating backwards through extents. We add the wrapper and have the extent IO functions call their storage _prev functions. Data extent IO can now call the new scoutfs_item_prev(). Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1fca13b092	scoutfs: add fallocate Add an fallocate operation. This changes the possible combinations of flags in extents and makes it possible to create extents beyond i_size. This will confuse the rest of the code in a few places and that will be fixed up next. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1c5d84fa3e	scoutfs: add counters for items written in level 0 Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	41c29c48dd	scoutfs: add extent corruption cases The extent code was originally written to panic if it hit errors during cleanup that resulted in inconsistent metadata. The more reasonble strategy is to warn about the corruption and act accordingly and leave it to corrective measures to resolve the corruption. In this case we continue returning the error that caused us to try and clean up. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	869d11fd0f	scoutfs: add core extent functions Add a file of extent functions that callers will use to manipulate and store extents in different persistent formats. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00

1 2

83 Commits