scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-01 18:35:43 +00:00

Author	SHA1	Message	Date
Zach Brown	ff532eba75	scoutfs: recover max lock write_version Write locks are given an increasing version number as they're granted which makes its way into items in the log btrees and is used to find the most recent version of an item. The initialization of the lock server's next write_version for granted locks dates back to the initial prototype of the forest of log btrees. It is only initialized to zero as the module is loaded. This means that reloading the module, perhaps by rebooting, resets all the item versions to 0 and can lead to newly written items being ignored in favour of older existing items with greater versions from a previous mount. To fix this we initialize the lock server's write_version to the greatest of all the versions in items in log btrees. We add a field to the log_trees struct which records the greatest version which is maintained as we write out items in transactions. These are read by the server as it starts. Then lock recovery needs to include the write_version so that the lock_server can be sure to set the next write_version past the greatest version in the currently granted locks. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Zach Brown	736d9d7df8	scoutfs: remove struct scoutfs_log_trees_val The log_trees structs store the data that is used by client commits. The primary struct is communicated over the wire so it includes the rid and nr that identify the log. The _val struct was stored in btree item values and was missing the rid and nr because those were stored in the item's key. It's madness to duplicate the entire struct just to shave off those two fields. We can remove the _val struct and store the main struct in item values, including the rid and nr. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:14:10 -07:00
Zach Brown	7a3749d591	scoutfs: incremental srch compaction Previously the srch compaction work would output the entire compacted file and delete the input files in one atomic commit. The server would send the input files and an allocator to the client, and the client would send back an output file and an allocator that included the deletion of the input files. The server would merge in the allocator and replace the input file items with the output file item. Doing it this way required giving an enormous allocation pool to the client in a radix, which would deal with recursive operations (allocating from and freeing to the radix that is being modified). We no longer have the radix allocator, and we use single block avail/free lists instead of recursively modifying the btrees with free extent items. The compaction RPC needs to work with a finite amount of allocator resources that can be stored in an alloc list block. The compaction work now does a fixed amount of work and a compaction operation spans multiple work iterations. A single compaction struct is now sent between the client and server in the get_compact and commit_compact messages. The client records any partial progress in the struct. The server writes that position into PENDING items. It first searchs for pending items to give to clients before searching for files to start a new compaction operation. The compact struct has flags to indicate whether the output file is being written or the input files are being deleted. The server manages the flags and sets the input file deletion flag only once the result of the compaction has been reflected in the btree items which record srch files. We added the progress fields to the compaction struct, making it even bigger than it already was, so we take the time to allocate them rather than declaring them on the stack. It's worth mentioning that each operation now takes a reasonably bounded amount of time will make it feasible to decide that it has failed and needs to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	2073a672a0	scoutfs: remove unused statfs RPC Remove the statfs RPC from the client and server now that we're using allocator iteration to calculate free blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	e60f4e7082	scoutfs: use full extents for data and alloc Previously we'd avoided full extents in file data mapping items because we were deleting items from forest btrees directly. That created deletion items for every version of file extents as they were modified. Now we have the item cache which can remove deleted items from memory when deletion items aren't necessary. By layering file data extents on an extent layer, we can also transition allocators to use extents and fix a lot of problems in the radix block allocator. Most of this change is churn from changing allocator function and struct names. File data extents no longer have to manage loading and storing from and to packed extent items at a fixed granularity. All those loops are torn out and data operations now call the extent layer with their callbacks instead of calling its packed item extent functions. This now means that fallocate and especially restoring offline extents can use larger extents. Small file block allocation now comes from a cached extent which reduces item calls for small file data streaming writes. The big change in the server is to use more root structures to manage recursive modification instead of relying on the allocator to notice and do the right thing. The radix allocator tried to notice when it was actively operating on a root that it was also using to allocate and free metadata blocks. This resulted in a lot of bugs. Instead we now double buffer the server's avail and freed roots so that the server fills and drains the stable roots from the previous transaction. We also double buffer the core fs metadata avail root so that we can increase the time to reuse freed metadata blocks. The server now only moves free extents into client allocators when they fall below a low threshold. This reduces the shared modification of the client's allocator roots which requires cold block reads on both the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:03 -07:00
Zach Brown	f4db553c28	scoutfs: fix error unwinding in server advance_seq While checking for lost server commit holds, I noticed that the advance_seq request path had obviously incorrect unwinding after getting an error. Fix it up so that it always unlocks and applies its commit. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	f8e1812288	scoutfs: add srch infrastructure This introduces the srch mechanism that we'll use to accelerate finding files based on the presence of a given named xattr. This is an optimized version of the initial prototype that was using locked btree items for .indx. xattrs. This is built around specific compressed data structures, having the operation cost match the reality of orders of magnitude more writers than readers, and adopting a relaxed locking model. Combine all of this and maintaining the xattrs no longer tanks creation rates while maintaining excellent search latencies, given that searches are defined as rare and relatively expensive. The core data type is the srch entry which maps a hashed name to an inode number. Mounts can append entries to the end of unsorted log files during their transaction. The server tracks these files and rotates them into a list of files as they get large enough. Mounts have compaction work that regularly asks the server for a set of files to read and combine into a single sorted output file. The server only initiates compactions when it sees a number of files of roughly the same size. Searches then walk all the commited srch files, both log files and sorted compacted files, looking for entries that associate an xattr name with an inode number. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	cca83b1758	scoutfs: rework get_fs_roots to get_roots The get_fs_roots rpc and server interfaces were built around individual roots. Rebuild it around passing around a struct so that we can add roots without impacting all the current users. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ab271f4682	scoutfs: report sm metadata blocks in statfs The conversion of the super block metadata block counters to units of large metadata blocks forgot to scale back to the small block size when filling out the block count fields in the statfs rpc. This resulted in the free and total metadata use being off by the factor of large to small block size (default of ~16x at the moment). Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	4d0b78f5cb	scoutfs: add counters for server commits Add some counters for server commits. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	b7943c5412	scoutfs: avoid reading forest roots with block IO The forest item operations were reading the super block to find the roots that it should read items from. This was easiest to implement to start, but it is too expensive. We have to find the roots for every newly acquired lock and every call to walk the inode seq indexes. To avoid all these reads we first send the current stable versions of the fs and logs btrees roots along with root grants. Then we add a net command to get the current stable roots from the server. This is used to refresh the roots if stale blocks are encountered and on the seq index queries. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	177af7f746	scoutfs: use larger metadata blocks Introduce different constants for small and large metadata block sizes. The small 4KB size is used for the super block, quorum blocks, and as the granularity of file data block allocation. The larger 64KB size is used for the radix, btree, and forest bloom metadata block structures. The bulk of this are obvious transitions from the old single constant to the appropriate new constant. But there are a few more involved changes, though just barely. The block crc calculation now needs the caller to pass in the size of the block. The radix function to return free bytes instead returns free blocks and the caller is responsible for knowing how big its managed blocks are. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ad99636af8	scoutfs: use scoutfs_key as btree key The btree currently uses variable length big-endian buffers that are compared with memcmp() as keys. This is a historical relic of the time when keys could be very large. We had dirent keys that included the name and manifest entries that included those fs keys. But now all the btree callers are jumping through hoops to translate their fs keys into big-endian btree keys. And the memcmp() of the keys is showing up in profiles. This makes the btree take native scoutfs_key structs as its key. The forest callers which are working with fs keys can just pass their keys straight through. The server btree callers with their private btrees get key fields definied for their use instead of having individual big-endian key structs. A nice side-effect of this is that splitting parents doesn't have to assume that a maximal key will be inserted by a child split. We can have more keys in parents and wider trees. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ff9386faba	scoutfs: export server commit holds The calls for holding and applying commits in the server are currently private. The lock server is a server component that has been seperated out into its own file. Most of the time the server calls it during commits so the btree changes made in the lock server are protected by the commits. But there are btree calls in the lock server that happen outside of calls from the server. Exporting these calls will let the lock server make all its btree changes in server commits. Signed-off-by: Zach Brown <zab@versity.com>	2020-06-18 14:07:43 -07:00
Zach Brown	9ad86d4d29	scoutfs: commit trans before premature enospc File data allocations come from radix allocators which are populated by the server before each client transation. It's possible to fully consume the data allocator within one transaction if the number of dirty metadata blocks is kept low. This could result in premature ENOSPC. This was happening to the archive-light-cycle test. If the transactions performed by previous tests lined up just right then the creation of the initial test files could see ENOSPC and cause all sorts of nonsense in the rest of the test, culminating in cmp commands stuck in offline waits. This introduces high and low data allocator water marks for transactions. The server tries to fill data allocators for each transaction to the high water mark and the client forces the commit of a transaction if its data allocator falls below the low water mark. The archive-light-cycle test now passes easily and we see the trans_commit_data_alloc_low counter increasing during the test. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-22 16:08:03 -07:00
Zach Brown	192453e717	scoutfs: add server error messages Add specific error messages for failures that can happen as the server commits log trees from the client. These are severe enough that we'd like to know about them. Signed-off-by: Zach Brown <zab@versity.com>	2020-04-10 12:02:18 -07:00
Zach Brown	44a7e2ab56	scoutfs: more carefully handle alloc cursors The first pass at the radix allocator wasn't paying a lot of attention to the allocation cursors. This more carefully manages them. They're only advanced after allocating. Previously the metadata alloc cursor was advanced as it searched through leaves that it might allocate from. We test for wrapping past the specific final allocatable bit, rather than the limit of what the radix height can store. This required pushing knoweldge of metadata or data allocs down through some of the code paths. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	76ed627548	scoutfs: reclaim freed metadata blocks in server Reclaim freed metadata blocks in the server by merging the stable freed tree into the allocator as a commit opens and we can trust that the stable version of the freed allocator in the super is a strict subset of the allocator's dirty freed tree. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	093f8ead58	scoutfs: refactor server commit locking Server processing paths had open coded management of holding and applying transactions. Refactor that into hold_commit() and apply_commit() helpers. It makes the code a whole lot clearer and gives us a place in hold_commit() to add code that needs to be run before anything is modified in a commit on the server. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	ce7f7bdbd3	scoutfs: reclaim client log allocators The server now consistently reclaims free space in client allocator radix trees. It merges the client's freed trees as the client opens a new transaction. And it reclaims all the client's trees when it is removed. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	128a2c64f4	scoutfs: restore df/statfs block counts The removal of extent allocators in the server removed the tracking of total free blocks in the system as extents were allocated and freed. This restores tracking of total free blocks by observing the difference in each allocator's sm_total count as a new version is stored during a commit on the server. We change the single free_blocks counter in the super to separate counts of free metadata and data blocks to reflect the metadata and data allocators. The statfs net command is updated. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	85142dcadf	scoutfs: use radix allocator Convert metadata block and file data extent allocations to use the radix allocator. Most of this is simple transitions between types and calls. The server no longer has to initialize blocks because mkfs can write a single radix parent block with fully set parent refs to initialize a full radix. We remove the code and fields that were responsible for adding uninitialized data and metadata. The rest of the unused block allocator code is only ifdefed out. It'll be removed in a separate patch to reduce noise here. Signed-off-by: Zach Brown <zab@versity.com>	2020-02-25 12:03:46 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	edd8fe075c	scoutfs: remove lsm code Remove all the now unused code that deals with lsm: segment IO, the item cache, and the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	e6af174c79	scoutfs: add commit btree net command Add a simple start of a command that the client will use to commit its dirty trees. This'll be expanded in the future to include more trees and block allocation. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	0f83dfd512	scoutfs: update block btree interfaces in server Teach the server to maintain and use its block allocator and writer contexts when operating on its btrees. The manifest tree operations aren't updated because they're about to be removed. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	ab7bde9e2c	scoutfs: replace node_id with rid in networking Use the client's rid in networking instead of the node_id. The node_id no longer has to be allocated by the server and sent in the greeting. Instead the client sends it to the server in its greeting. The server then uses the client's announced rid just like it used to use the its node_id. It's used to record clients in the btree and to identify clients in sending and receive processing. The use of the rid in networking calls makes its way to locking and compaction which now use the rid to identify clients intead of the node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5b258cee3b	scoutfs: refine quorum voting The current quorum voting implementatoin had some rough edges that increased the complexity of the system and introduced undesirable failure modes. We can keep the same basic pattern but move functionality around a few places, and rethink the quorum voting, to end up with a meaningfully simpler system. The motivation for this work was to remove the need to provide a uniq_name option for every mount instance. The first big change is to remove the idea of static configuration slots for mounts. This removes the use of uniq_name. Mounts now simply have a server_addr mount option instead of using their uniq_name to find their address in the configuration. The server can't check the configuration to see if a given connected client's name is found in the quorum config. Clients can set a flag in their sent greeting which indicates that they're a voter. This removes the uniq_name from the greeting and mounted client records. Without a static configuration mounts no longer have dedicated block locations to write to. We increase the size of the region of quorum blocks and have voters simply write to a random block. Overwriting vote blocks is OK because we move from heartbeating design patterns to a protocol strongly based on raft's election. We're using quorum blocks to communicate votes instead of network messages and overwriting blocks is analagous to lossy networks droping vote messages in the raft election protocol. We were using the dedicated per-mount quorum blocks to track mounts that had been elected and needed to be fenced. We no longer have that storage so instead we add the idea of an election log that is stored in every voting block. Readers merge the logs from all the blocks they read and write the resulting merged log in their block. With no static quorum configuration we no longer have to worry about the complexity of changing the slot configurations while they're in use. The only persistent configuration is the number of votes a candidate needs to be elected by a quorum. It was a mistake to use quorum voting blocks to communicate state between the server and the quorum voters. We can easily move the unmount_barrier, server address, and fencing state from the quorum blocks into the super block. The server no longer needs the quorum election info struct to be able to later write its quorum block. It instead writes a few fields in the super. There's only one place where clients need to look to find out who they should connect to or if they can finish unmount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	532256271c	scoutfs: simplify scoutfs_write_super() The pattern of advancing and writing a "dirty super" comes from the time when the format had two persistent super blocks. One was kept in memory and modified as changes were made. Advancing it changed which of the two supers would be eventually written. This no longer makes sense now that we only have one super block. Remove the idea of advancing and writing an implicit dirty super block that's stored in the super block info. Instead use a single scoutfs_write_super() which takes the super block struct to write. We still store and increment the hdr.gen in the super block. It used to be used to tell which of the two super blocks are more recent, now it is just some information that can tell us something about the life of the super block. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	c061ada671	scoutfs: mounts connect once server is listening An elected leader writes a quorum block showing that it's elected before it assumes exclusive access to the device and starts bringing up the server. This lets another later elected leader find and fence it if something happens. Other mounts were trying to connect to the server once this elected quorum block was written and before the server was listening. They'd get conection refused, decide to elect a new leader, and try to fence the server that's still running. Now, they should have tried much harder to connect to the elected leader instead of taking a single failed attempt as fatal. But that's a problem for another day that involves more work in balancing timeouts and retries. But mounts should not have tried try to connect to the server until its listening. That's easy to signal by adding a simple listening flag to the quorum block. Now mounts will only try to connect once they see the listening flag and don't see these racey refused connections. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 15:01:00 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	fe63b566c9	scoutfs: use _unaligned instead of __packed We were relying on a cute (and probably broken) trick of defining pointers to unaligned base types with __packed. Modern versions of gcc warn about this. Instead we either directly access unaligned types with get_ and put_unaligned, or we copy unaligned data into aligned copies before working with it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e88b5732ad	scoutfs: track trans seq in btree Currently the server tracks the outstanding transaction sequence numbers that clients have open in a simple list in memory. It's not properly cleaned up if a client unmounts and a new server that takes over after a crash won't know about open transaction sequence numbers. This stores open transaction sequence numbers in a shared persistent btree instead of in memory. It removes tracking for clients as they send their farewell during unmount. A new server that starts up will see existing entries for clients that were created by old servers. This fixes a bug where a client who unmounts could leave behind a pending sequence number that would never be cleaned up and would indefinitely limit the visibility of index items that came after it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	675275fbf1	scoutfs: use hdr.fsid in greeting instead of id The network greeting exchange was mistakenly using the global super block magic number instead of the per-volume fsid to identify the volumes that the endpoints are working with. This prevented the check from doing its only job: to fail when clients in one volume try to connect to a server in another. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	288d781645	scoutfs: start and stop server with quorum Currently all mounts try to get a dlm lock which gives them exclusive access to become the server for the filesystem. That isn't going to work if we're moving to locking provided by the server. This uses quorum election to determine who should run the server. We switch from long running server work blocked trying to get a lock to calls which start and stop the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	34b8950bca	scoutfs: initial lock server core Add the core lock server code for providing a lock service from our server. The lock messages are wired up but nothing calls them. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	7e9d40d65a	scoutfs: init ret when freeing zero extents The server forgot to initialize ret to 0 and might return undefined errnos if a client asked it to free zero extents. Signed-off-by: Zach Brown <zab@versity.com>	2018-09-12 15:37:45 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	62d6c11e3c	scoutfs: clean up workqueue flags We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND in some workqueues where we wanted concurrency by scheduling across cpus instead of waiting for the current (very long running) work on a cpu to finish. We add NON_REENTRANT out of an abundance of caution. It has gone away in modern kernels and is probably not needed here, but according to the docs we would want it so we at least document that fact by using it. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	0adbd7e439	scoutfs: have server track connected clients This extends the notify up and down calls to let the server keep track of connected clients. It adds the notion of per-connection info that is allocated for each connection. It's passed to the notification callbacks so that callers can have per-client storage without having to manage allocations in the callbacks. It adds the node_id argument to the notification callbacks to indicate if the call is for the listening socket itself or an accepted client connection on that listening socket. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	746293987c	scoutfs: let server send msg to specific node_id The current sending interfaces only send a message to the peer of a given connection. For the server to send to a specific connected client it'd have to track connections itself and send to them. This adds a sending interface that uses the node_id to send to a specific connected client. The conn argument is the listening socket and its accepted sockets are searched for the destination node_id. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	8b3193ea72	scoutfs: server allocates node_id Today node_ids are randomly assigned. This adds the risk of failure from random number generation and still allows for the risk of collisions. Switch to assigning strictly advancing node_ids on the server during the initial connection greeting message exchange. This simplifies the system and allows us to derive information from the relative values of node_ids in the system. To do this we refactor the greeting code from internal to the net layer to proper client and server request and response processing. This lets the server manage persistent node_id storage and allows the client to wait for a node_id during mount. Now that net_connect is sync in the client we don't need the notify_up callback anymore. The client can perform those duties when the connect returns. The net code still has to snoop on request and response processing to see when the greetings have been exchange and allow messages to flow. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	a25b6324d2	scoutfs: maintain free_blocks in one place The free_blocks counter in the super is meant to track the number of total blocks in the primary free extent index. Callers of extent manipulation were trying to keep it in sync with the extents. Segment allocation was allocating extents manually using a cursor. It forgot to update free_blocks. Segment freeing then freed the segment as an extent which did update free_blocks. This created ever accumulating free blocks over time which eventually pushed it greater than total blocks and caused df to report negative usage. This updates the free_blocks count in server extent io which is the only place we update the extent items themselves. This ensures that we'll keep the count in sync with the extent items. Callers don't have to worry about it. Signed-off-by: Zach Brown <zab@versity.com> T# with '#' will be ignored, and an empty message aborts the commit.	2018-08-21 13:25:05 -07:00
Zach Brown	d708421cfb	scoutfs: remove unused client and server code The previous commit added shared networking code and disabled the old unused code. This removes all that unused client and server code that was refactored to become the shared networking code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	295bf6b73b	scoutfs: return free extents to server Freed file data extents are tracked in free extent items in each node. They could only be re-used in the future for file data extent allocation on that node. Allocations on other nodes or, critically, segment allocation on the server could never see those free extents. With the right allocation patterns, particularly allocating on node X and freeing on node Y, all the free extents can build up on a node and starve other allocations. This adds a simple high water mark after which nodes start returning free extents to the server. From there they can satisfy segment allocations or be sent to other nodes for file data extent allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-05 16:19:31 -07:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	002daf3c1c	scoutfs: return -ENOSPC to client alloc segno The server send_reply interface is confusing. It uses errors to shut down the connection. Clients getting enospc needs to happen in the message reply payload. The segno allocation server processing needs to set the segno to 0 so that the client gets it and translates that into -ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00

1 2

71 Commits