scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-02 02:45:43 +00:00

Author	SHA1	Message	Date
Zach Brown	ab7bde9e2c	scoutfs: replace node_id with rid in networking Use the client's rid in networking instead of the node_id. The node_id no longer has to be allocated by the server and sent in the greeting. Instead the client sends it to the server in its greeting. The server then uses the client's announced rid just like it used to use the its node_id. It's used to record clients in the btree and to identify clients in sending and receive processing. The use of the rid in networking calls makes its way to locking and compaction which now use the rid to identify clients intead of the node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5b258cee3b	scoutfs: refine quorum voting The current quorum voting implementatoin had some rough edges that increased the complexity of the system and introduced undesirable failure modes. We can keep the same basic pattern but move functionality around a few places, and rethink the quorum voting, to end up with a meaningfully simpler system. The motivation for this work was to remove the need to provide a uniq_name option for every mount instance. The first big change is to remove the idea of static configuration slots for mounts. This removes the use of uniq_name. Mounts now simply have a server_addr mount option instead of using their uniq_name to find their address in the configuration. The server can't check the configuration to see if a given connected client's name is found in the quorum config. Clients can set a flag in their sent greeting which indicates that they're a voter. This removes the uniq_name from the greeting and mounted client records. Without a static configuration mounts no longer have dedicated block locations to write to. We increase the size of the region of quorum blocks and have voters simply write to a random block. Overwriting vote blocks is OK because we move from heartbeating design patterns to a protocol strongly based on raft's election. We're using quorum blocks to communicate votes instead of network messages and overwriting blocks is analagous to lossy networks droping vote messages in the raft election protocol. We were using the dedicated per-mount quorum blocks to track mounts that had been elected and needed to be fenced. We no longer have that storage so instead we add the idea of an election log that is stored in every voting block. Readers merge the logs from all the blocks they read and write the resulting merged log in their block. With no static quorum configuration we no longer have to worry about the complexity of changing the slot configurations while they're in use. The only persistent configuration is the number of votes a candidate needs to be elected by a quorum. It was a mistake to use quorum voting blocks to communicate state between the server and the quorum voters. We can easily move the unmount_barrier, server address, and fencing state from the quorum blocks into the super block. The server no longer needs the quorum election info struct to be able to later write its quorum block. It instead writes a few fields in the super. There's only one place where clients need to look to find out who they should connect to or if they can finish unmount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	a546bd0aab	scoutfs: check for newlines in msg.h wrappers The message formatter adds a newline so callers don't have to. But sometimes they do and we get double newlines. Add a build check that the format string doesn't end in a newline so that we stop adding these. And fix up all the current offenders. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	675275fbf1	scoutfs: use hdr.fsid in greeting instead of id The network greeting exchange was mistakenly using the global super block magic number instead of the per-volume fsid to identify the volumes that the endpoints are working with. This prevented the check from doing its only job: to fail when clients in one volume try to connect to a server in another. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	288d781645	scoutfs: start and stop server with quorum Currently all mounts try to get a dlm lock which gives them exclusive access to become the server for the filesystem. That isn't going to work if we're moving to locking provided by the server. This uses quorum election to determine who should run the server. We switch from long running server work blocked trying to get a lock to calls which start and stop the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	0adbd7e439	scoutfs: have server track connected clients This extends the notify up and down calls to let the server keep track of connected clients. It adds the notion of per-connection info that is allocated for each connection. It's passed to the notification callbacks so that callers can have per-client storage without having to manage allocations in the callbacks. It adds the node_id argument to the notification callbacks to indicate if the call is for the listening socket itself or an accepted client connection on that listening socket. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	8b3193ea72	scoutfs: server allocates node_id Today node_ids are randomly assigned. This adds the risk of failure from random number generation and still allows for the risk of collisions. Switch to assigning strictly advancing node_ids on the server during the initial connection greeting message exchange. This simplifies the system and allows us to derive information from the relative values of node_ids in the system. To do this we refactor the greeting code from internal to the net layer to proper client and server request and response processing. This lets the server manage persistent node_id storage and allows the client to wait for a node_id during mount. Now that net_connect is sync in the client we don't need the notify_up callback anymore. The client can perform those duties when the connect returns. The net code still has to snoop on request and response processing to see when the greetings have been exchange and allow messages to flow. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	d708421cfb	scoutfs: remove unused client and server code The previous commit added shared networking code and disabled the old unused code. This removes all that unused client and server code that was refactored to become the shared networking code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	295bf6b73b	scoutfs: return free extents to server Freed file data extents are tracked in free extent items in each node. They could only be re-used in the future for file data extent allocation on that node. Allocations on other nodes or, critically, segment allocation on the server could never see those free extents. With the right allocation patterns, particularly allocating on node X and freeing on node Y, all the free extents can build up on a node and starve other allocations. This adds a simple high water mark after which nodes start returning free extents to the server. From there they can satisfy segment allocations or be sent to other nodes for file data extent allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-05 16:19:31 -07:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	2efba47b77	scoutfs: satisfy large allocs with smaller extents The previous fallocate and get_block allocators only looked for free extents larger than the requested allocation size. This prematurely returns -ENOSPC if a very large allocation is attempted. Some xfstests stress low free space situations by fallocating almost all the free space in the volume. This adds an allocation helper function that finds the biggest free extent to satisfy an allocation, psosibly after trying to get more free extents from the server. It looks for previous extents in the index of extents by length. This builds on the previously added item and extent _prev operations. Allocators need to then know the size of the allocation they got instead of assuming they got what they asked for. The server can also return a smaller extent so it needs to communicate the extent length, not just its start. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	ac259c82a0	scoutfs: allow interrupting client sends Waiting for replies to sent requests wasn't interruptible. This was preventing ctl-c from breaking out of mount when a server wasn't yet around to accept connections. The only complication was that the receive thread was accessing the sender's struct outside of the lock. An interrupted sender could remove their struct while receive was processing it. We rework recv processing so that it only uses the sender struct under the lock. This introduces a cpu copy of the payload but they're small and relatively infrequent control messages. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 15:49:14 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	4ff1e3020f	scoutfs: allocate inode numbers per directory Having an inode number allocation pool in the super block meant that all allocations across the mount are interleaved. This means that concurrent file creation in different directories will create overlapping inode numbers. This leads to lock contention as reasonable work loads will tend to distribute work by directories. The easy fix is to have per-directory inode number allocation pools. We take the opportunity to clean up the network request so that the caller gets the allocation instead of having it be fed back in via a weird callback. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-09 17:58:19 -08:00
Zach Brown	cb879d9f37	scoutfs: add network greeting message Add a network greeting message that's exchanged between the client and server on every connection to make sure that we have the correct file system and format hash. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-12 13:57:31 -07:00
Zach Brown	ca78757ca5	scoutfs: more careful client connect timeouts The client connection loop was a bit of a mess. It only slept between retries in one particular case. Other failures to connect would spin and livelock. It would spin forever. This fixed loop now has a much more orderly reconnect procedure. Each connecting sender always tries once. Then retry attempts backoff exponentially, settling at a nice long timeout. After long enough it'll return errors. This fixes livelocks in the xfstests that mount and unmount around dm-flakey config. generic/{034,039,040} would easily livelock before this fix. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-30 10:38:00 -07:00
Zach Brown	87ab27beb1	scoutfs: add statfs network message The ->statfs method was still using the super_block in the super_info that was read during mount. This will get progressively more out of date. We add a network message to ask the server for the current fields that impact statfs. This is always racy and the fields are mostly nonsense, but we try our best. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-11 10:43:35 -07:00
Zach Brown	c1b2ad9421	scoutfs: separate client and server net processing The networking code was really suffering by trying to combine the client and server processing paths into one file. The code can be a lot simpler by giving the client and server their own processing paths that take their different socket lifecysles into account. The client maintains a single connection. Blocked senders work on the socket under a sending mutex. The recv path runs in work that can be canceled after first shutting down the socket. A long running server work function acquires the listener lock, manages the listening socket, and accepts new sockets. Each accepted socket has a single recv work blocked waiting for requests. That then spawns concurrent processing work which sends replies under a sending mutex. All of this is torn down by shutting down sockets and canceling work which frees its context. All this restructuring makes it a lot easier to track what is happening in mount and unmount between the client and server. This fixes bugs where unmount was failing because the monolithic socket shutdown function was queueing other work while running while draining. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-04 10:47:42 -07:00

27 Commits