scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-05-01 10:25:43 +00:00

Author	SHA1	Message	Date
Zach Brown	44f38a31ec	Make server commit access private again There was a brief time where we exported the ability to hold and apply commits outside of the main server code. That wasn't a great idea, and the few users have seen been reworked to not require directly manipulating server transactions, so we can reduce risk and make these functions private again. Signed-off-by: Zach Brown <zab@versity.com>	2022-04-01 15:21:43 -07:00
Zach Brown	bb3db7e272	Send quorum heartbeats while fencing Quorum members will try to elect a new leader when they don't receive heartbeats from the currently elected leader. This timeout is short to encourage restoring service promptly. Heartbeats are sent from the quorum worker thread and are delayed while it synchronously starts up the server, which includes fencing previous servers. If fence requests take too long then heartbeats will be delayed long enough for remaining quorum members to elect a new leader while the recently elected server is still busy fencing. To fix this we decouple server startup from the quorum main thread. Server starting and stopping becomes asynchronous so the quorum thread is able to send heartbeats while the server work is off starting up and fencing. The server used to call into quorum to clear a flag as it exited. We remove that mechanism and have the server maintain a running status that quorum can query. We add some state to the quorum work to track the asynchronous state of the server. This lets the quorum protocol change roles immediately as needed while remembering that there is a server running that needs to be acted on. The server used to also call into quorum to update quorum blocks. This is a read-modify-write operation that has to be serialized. Now that we have both the server starting up and the quorum work running they both can't perform these read-modify-write cycles. Instead we have the quorum work own all the block updates and it queries the server status to determine when it should update the quorum block to indicate that the server has fenced or shut down. Signed-off-by: Zach Brown <zab@versity.com>	2022-03-31 10:29:43 -07:00
Zach Brown	29cfa81574	Remove unused leftovers from quorum changes These forward declarations were for interfaces that have since been removed or changed and are no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	9051ceb6fc	Add core seq to the super block Add a new seq field to the super block which will be the source of all incremented seqs throughout the system. We give out incremented seqs to callers with an atomic64_t in memory which is synced back to the super block as we commit transactions in the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:33:30 -07:00
Zach Brown	bad1c602f9	server hold_commit returns void When we moved to the current allocator we fixed up the server commit path to initialize the pair of allocators as a commit is finished rather than before it starts. This removed all the error cases from hold_commit. Remove the error handling from hold_commit calls to make the system just a bit simpler. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-01 13:32:26 -07:00
Zach Brown	22371fe5bd	Fully destroy inodes after all mounts evict Today an inode's items are deleted once its nlink reaches zero and the final iput is called in a local mount. This can delete inodes from under other mounts which have opened the inode before it was unlinked on another mount. We fix this by adding cached inode tracking. Each mount maintains groups of cached inode bitmaps at the same granularity as inode locking. As a mount performs its final iput it gets a bitmap from the server which indicates if any other mount has inodes in the group open. This makes the two fast paths of opening and closing linked files and of deleting a file that was unlinked locally only pay a moderate cost of either maintaining the bitmap locally and only getting the open map once per lock group. Removing many files in a group will only lock and get the open map once per group. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-21 12:17:33 -07:00
Zach Brown	592f472a1c	Use recov in server to recover client greetings The server starts recovery when it finds mounted client items as it starts up. The clients are done recovering once they send their greeting. If they don't recover in time then they'll be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Andy Grover	6406f05350	cleanup: Remove struct net_lock_grant_response We're not using the roots member of this struct, so we can just use struct scoutfs_net_lock directly. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-07 10:13:56 -07:00
Zach Brown	8e34c5d66a	Use quorum slots and background election work Previously quorum configuration specified the number of votes needed to elected the leader. This was an excessive amount of freedom in the configuration of the cluster which created all sorts of problems which had to be designed around. Most acutely, though, it required a probabilistic mechanism for mounts to persistently record that they're starting a server so that future servers could find and possibly fence them. They would write to a lot of quorum blocks and trust that it was unlikely that future servers would overwrite all of their written blocks. Overwriting was always possible, which would be bad enough, but it also required so much IO that we had to use long election timeouts to avoid spurious fencing. These longer timeouts had already gone wrong on some storage configurations, leading to hung mounts. To fix this and other problems we see coming, like live membership changes, we now specifically configure the number and identity of mounts which will be participating in quorum voting. With specific identities, mounts now have a corresponding specific block they can write to and which future servers can read from to see if they're still running. We change the quorum config in the super block from a single quorum_count to an array of quorum slots which specify the address of the mount that is assigned to that slot. The mount argument to specify a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr" which specifies the mount's slot. The slot's address is used for udp election messages and tcp server connections. Now that we specifically have configured unique IP addresses for all the quorum members, we can use UDP messages to send and receive the vote mesages in the raft protocol to elect a leader. The quorum code doesn't have to read and write disk block votes and is a more reasonable core loop that either waits for received network messages or timeouts to advance the raft election state machine. The quorum blocks are now used for slots to store their persistent raft term and to set their leader state. We have event fields in the block to record the timestamp of the most recent interesting events that happened to the slot. Now that raft doesn't use IO, we can leave the quorum election work running in the background. The raft work in the quorum members is always running so we can use a much more typical raft implementation with heartbeats. Critically, this decouples the client and election life cycles. Quorum is always running and is responsible for starting and stopping the server. The client repeatedly tries to connect to a server, it has nothing to do with deciding to participate in quorum. Finally, we add a quorum/status sysfs file which shows the state of the quorum raft protocol in a member mount and has the last messages that were sent to or received from the other members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-18 12:57:30 -08:00
Zach Brown	cca83b1758	scoutfs: rework get_fs_roots to get_roots The get_fs_roots rpc and server interfaces were built around individual roots. Rebuild it around passing around a struct so that we can add roots without impacting all the current users. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	b7943c5412	scoutfs: avoid reading forest roots with block IO The forest item operations were reading the super block to find the roots that it should read items from. This was easiest to implement to start, but it is too expensive. We have to find the roots for every newly acquired lock and every call to walk the inode seq indexes. To avoid all these reads we first send the current stable versions of the fs and logs btrees roots along with root grants. Then we add a net command to get the current stable roots from the server. This is used to refresh the roots if stale blocks are encountered and on the seq index queries. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:12 -07:00
Zach Brown	ff9386faba	scoutfs: export server commit holds The calls for holding and applying commits in the server are currently private. The lock server is a server component that has been seperated out into its own file. Most of the time the server calls it during commits so the btree changes made in the lock server are protected by the commits. But there are btree calls in the lock server that happen outside of calls from the server. Exporting these calls will let the lock server make all its btree changes in server commits. Signed-off-by: Zach Brown <zab@versity.com>	2020-06-18 14:07:43 -07:00
Zach Brown	edd8fe075c	scoutfs: remove lsm code Remove all the now unused code that deals with lsm: segment IO, the item cache, and the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	ab7bde9e2c	scoutfs: replace node_id with rid in networking Use the client's rid in networking instead of the node_id. The node_id no longer has to be allocated by the server and sent in the greeting. Instead the client sends it to the server in its greeting. The server then uses the client's announced rid just like it used to use the its node_id. It's used to record clients in the btree and to identify clients in sending and receive processing. The use of the rid in networking calls makes its way to locking and compaction which now use the rid to identify clients intead of the node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5b258cee3b	scoutfs: refine quorum voting The current quorum voting implementatoin had some rough edges that increased the complexity of the system and introduced undesirable failure modes. We can keep the same basic pattern but move functionality around a few places, and rethink the quorum voting, to end up with a meaningfully simpler system. The motivation for this work was to remove the need to provide a uniq_name option for every mount instance. The first big change is to remove the idea of static configuration slots for mounts. This removes the use of uniq_name. Mounts now simply have a server_addr mount option instead of using their uniq_name to find their address in the configuration. The server can't check the configuration to see if a given connected client's name is found in the quorum config. Clients can set a flag in their sent greeting which indicates that they're a voter. This removes the uniq_name from the greeting and mounted client records. Without a static configuration mounts no longer have dedicated block locations to write to. We increase the size of the region of quorum blocks and have voters simply write to a random block. Overwriting vote blocks is OK because we move from heartbeating design patterns to a protocol strongly based on raft's election. We're using quorum blocks to communicate votes instead of network messages and overwriting blocks is analagous to lossy networks droping vote messages in the raft election protocol. We were using the dedicated per-mount quorum blocks to track mounts that had been elected and needed to be fenced. We no longer have that storage so instead we add the idea of an election log that is stored in every voting block. Readers merge the logs from all the blocks they read and write the resulting merged log in their block. With no static quorum configuration we no longer have to worry about the complexity of changing the slot configurations while they're in use. The only persistent configuration is the number of votes a candidate needs to be elected by a quorum. It was a mistake to use quorum voting blocks to communicate state between the server and the quorum voters. We can easily move the unmount_barrier, server address, and fencing state from the quorum blocks into the super block. The server no longer needs the quorum election info struct to be able to later write its quorum block. It instead writes a few fields in the super. There's only one place where clients need to look to find out who they should connect to or if they can finish unmount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	3d82dd3a46	scoutfs: fix bad octet in tracing ipv4 address The macro for producing trace args for an ipv4 address had a typo when shifting the third octet down before masking. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	288d781645	scoutfs: start and stop server with quorum Currently all mounts try to get a dlm lock which gives them exclusive access to become the server for the filesystem. That isn't going to work if we're moving to locking provided by the server. This uses quorum election to determine who should run the server. We switch from long running server work blocked trying to get a lock to calls which start and stop the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	34b8950bca	scoutfs: initial lock server core Add the core lock server code for providing a lock service from our server. The lock messages are wired up but nothing calls them. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	bafa4a6720	scoutfs: add net header printk args We have macros for creating and printing trace arguments for our network header struct. Add a macro for making simple printk call args for normal formatted output callers. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	5d9ad0923a	scoutfs: trace net structs The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	c1b2ad9421	scoutfs: separate client and server net processing The networking code was really suffering by trying to combine the client and server processing paths into one file. The code can be a lot simpler by giving the client and server their own processing paths that take their different socket lifecysles into account. The client maintains a single connection. Blocked senders work on the socket under a sending mutex. The recv path runs in work that can be canceled after first shutting down the socket. A long running server work function acquires the listener lock, manages the listening socket, and accepts new sockets. Each accepted socket has a single recv work blocked waiting for requests. That then spawns concurrent processing work which sends replies under a sending mutex. All of this is torn down by shutting down sockets and canceling work which frees its context. All this restructuring makes it a lot easier to track what is happening in mount and unmount between the client and server. This fixes bugs where unmount was failing because the monolithic socket shutdown function was queueing other work while running while draining. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-04 10:47:42 -07:00

30 Commits