scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-07 03:00:44 +00:00

Author	SHA1	Message	Date
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	20f4e1c338	scoutfs: put magic value in block header The super block had a magic value that was used to identify that the block should contain our data structure. But it was called an 'id' which was confused with the header fsid in the past. Also, the btree blocks aren't using a similar magic value at all. This moves the magic value in to the header and creates values for the super block and btree blocks. Both are written but the btree block reads don't check the value. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	675275fbf1	scoutfs: use hdr.fsid in greeting instead of id The network greeting exchange was mistakenly using the global super block magic number instead of the per-volume fsid to identify the volumes that the endpoints are working with. This prevented the check from doing its only job: to fail when clients in one volume try to connect to a server in another. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	288d781645	scoutfs: start and stop server with quorum Currently all mounts try to get a dlm lock which gives them exclusive access to become the server for the filesystem. That isn't going to work if we're moving to locking provided by the server. This uses quorum election to determine who should run the server. We switch from long running server work blocked trying to get a lock to calls which start and stop the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	7c8383eddd	scoutfs: add scoutfs_lock_rename() Add a specific lock method for locking the global rename lock instead of having the caller specify it as a global lock. We're getting rid of the notion of lock scopes and requiring all locks to be related to keys. The rename lock will use magic keys at the end of the volume. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	34b8950bca	scoutfs: initial lock server core Add the core lock server code for providing a lock service from our server. The lock messages are wired up but nothing calls them. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	f472c0bc87	scoutfs: add scoutfs_net_response_node() Today all responses can only be sent down the connection that sent the response while the request is being processed. We'll be adding subsystems that need to send responses asynchronously after initial request processing. Give them a call to send a response to a node id instead of to a node's connection. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	c34dd452a7	scoutfs: add quorum voting Add a quorum election implementation. The mounts that can participate in the election are specified in a quorum config array in the super block. Each configured participant is assigned a preallocated block that it can write to. All mounts read the quorum blocks to find the member who was elected the leader and should be running the server. The voting mounts loop reading voting blocks and writing their vote block until someone is elected with a amjority. Nothing calls this code yet, this adds the initial implementation and format. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	d57b8232ee	scoutfs: move base types in format.h We had scattered some base types throughout the format file which made them annoying to reference in higher level structs. Let's put them at the top so we can use them without declarations or moving things around in unrelated commits. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	f75e1e1322	scoutfs: reformat Makefile to one object per line Reformat the scoutfs-y object list so that there's one object per line. Diffs now clearly demonstrate what is changing instead of having word wrapping constantly obscuring changes in the built objects. (Did everyone spot the scoutfs_trace sorting mistake? Another reason not to mash everything into wrapped lines :)). Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	6caa87458b	scoutfs: add scoutfs_net_client_node_id() Some upcoming network request processing paths need access to the connected client's node_id. We could add it to the arguments but that'd be a lot of churn so we'll add an accessor function for now. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e9f6e79d67	scoutfs: add uniq_name mount option Each mount is getting a specified unique name. This can be used to identify a reconnecting mount that indicates that an old instance of the same unique name can no longer exist and doesn't need to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	8fedfef1cc	scoutfs: remove stale net response data comment There was a time when responding with an error wouldn't include the caller's data payload. That hasn't been the case since we added compaction network requests which include a reference to the compaction operation with the error response. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	91d190622d	scoutfs: remove scoutfs.md file The current plan is to maintain a nice paper describing the system in the scoutfs-utils repository. Signed-off-by: Zach Brown <zab@versity.com>	2018-09-25 13:02:11 -07:00
Brandon Philips	9bb0c60c63	README: add whitepaper link The white paper is helpful and not linked from the Github README which will be a primary landing spot for folks discovering the project.	2018-09-19 11:03:11 -07:00
Zach Brown	f8d1489415	scoutfs: add README.md Add a README.md for github. Signed-off-by: Zach Brown <zab@versity.com>	2018-09-14 15:18:27 -07:00
Zach Brown	5616175041	scoutfs: update rpm building infrastructure Update the makefile and spec to our current method of building rpms. Signed-off-by: Zach Brown <zab@versity.com>	2018-09-14 15:07:10 -07:00
Zach Brown	7e9d40d65a	scoutfs: init ret when freeing zero extents The server forgot to initialize ret to 0 and might return undefined errnos if a client asked it to free zero extents. Signed-off-by: Zach Brown <zab@versity.com>	2018-09-12 15:37:45 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	62d6c11e3c	scoutfs: clean up workqueue flags We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND in some workqueues where we wanted concurrency by scheduling across cpus instead of waiting for the current (very long running) work on a cpu to finish. We add NON_REENTRANT out of an abundance of caution. It has gone away in modern kernels and is probably not needed here, but according to the docs we would want it so we at least document that fact by using it. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	30d5471e4a	scoutfs: call net response func outside lock Today response processing calls a requests's response callback from inside the net spinlock. This happened to work for the synchronous blocking request handler who only had to record the result and wake their waiter. It doesn't work for server compact response processing which needs to use IO to commit the result of the compaction. This lifts the call to the response function out of complete_send() and into the response processing work function. Other complete_send() callers now won't trigger the response function call and can't see errors, which they all ignored anyway. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	00adbd31be	scoutfs: add sparse bitmap library Add a quick library for maintaining a very large bitmap with sparse allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	1ed0c6017f	scoutfs: remove unused keys manifest field Keys used to be variable length so the manifest struct on the wire ended in key payloads. The keys are now fixed size so that field is no longer necessary or used. It's an artifact that should have been removed when the keys were made fixed length. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	0adbd7e439	scoutfs: have server track connected clients This extends the notify up and down calls to let the server keep track of connected clients. It adds the notion of per-connection info that is allocated for each connection. It's passed to the notification callbacks so that callers can have per-client storage without having to manage allocations in the callbacks. It adds the node_id argument to the notification callbacks to indicate if the call is for the listening socket itself or an accepted client connection on that listening socket. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	746293987c	scoutfs: let server send msg to specific node_id The current sending interfaces only send a message to the peer of a given connection. For the server to send to a specific connected client it'd have to track connections itself and send to them. This adds a sending interface that uses the node_id to send to a specific connected client. The conn argument is the listening socket and its accepted sockets are searched for the destination node_id. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	8b3193ea72	scoutfs: server allocates node_id Today node_ids are randomly assigned. This adds the risk of failure from random number generation and still allows for the risk of collisions. Switch to assigning strictly advancing node_ids on the server during the initial connection greeting message exchange. This simplifies the system and allows us to derive information from the relative values of node_ids in the system. To do this we refactor the greeting code from internal to the net layer to proper client and server request and response processing. This lets the server manage persistent node_id storage and allows the client to wait for a node_id during mount. Now that net_connect is sync in the client we don't need the notify_up callback anymore. The client can perform those duties when the connect returns. The net code still has to snoop on request and response processing to see when the greetings have been exchange and allow messages to flow. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	f06b39cd7e	scoutfs: destroy items after locks We were destroying the item subsystem before shutting down locking. This is wrong because locking shutdown invalidates items covered by the locks. It can walk into freed memory and crash or corrupt other memory. The fix is to tear down the item subsystem after tearing down locks. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-24 15:40:53 -07:00
Zach Brown	ed9f4b6a22	scoutfs: calculate and enforce segment csum We had fields in the segment header for the crc but weren't using it. This calculates the crc on write and verifies it on read. The crc covers the used bytes in the segment as indicated by the total_bytes field. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-21 13:28:36 -07:00
Zach Brown	a25b6324d2	scoutfs: maintain free_blocks in one place The free_blocks counter in the super is meant to track the number of total blocks in the primary free extent index. Callers of extent manipulation were trying to keep it in sync with the extents. Segment allocation was allocating extents manually using a cursor. It forgot to update free_blocks. Segment freeing then freed the segment as an extent which did update free_blocks. This created ever accumulating free blocks over time which eventually pushed it greater than total blocks and caused df to report negative usage. This updates the free_blocks count in server extent io which is the only place we update the extent items themselves. This ensures that we'll keep the count in sync with the extent items. Callers don't have to worry about it. Signed-off-by: Zach Brown <zab@versity.com> T# with '#' will be ignored, and an empty message aborts the commit.	2018-08-21 13:25:05 -07:00
Zach Brown	a72b7a9001	scoutfs: convert locks seq to trivial seq Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	07df8816e3	scoutfs: add trivial seq file for net messages Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	bafa4a6720	scoutfs: add net header printk args We have macros for creating and printing trace arguments for our network header struct. Add a macro for making simple printk call args for normal formatted output callers. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	8ff3ef3131	scoutfs: add trivial seq file for net connections Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	c4cb5c0651	scoutfs: add trivial seq file wrapper Add a seq file wrapper which lets callers track objects easily. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	d708421cfb	scoutfs: remove unused client and server code The previous commit added shared networking code and disabled the old unused code. This removes all that unused client and server code that was refactored to become the shared networking code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	295bf6b73b	scoutfs: return free extents to server Freed file data extents are tracked in free extent items in each node. They could only be re-used in the future for file data extent allocation on that node. Allocations on other nodes or, critically, segment allocation on the server could never see those free extents. With the right allocation patterns, particularly allocating on node X and freeing on node Y, all the free extents can build up on a node and starve other allocations. This adds a simple high water mark after which nodes start returning free extents to the server. From there they can satisfy segment allocations or be sent to other nodes for file data extent allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-05 16:19:31 -07:00
Zach Brown	784cda9bee	scoutfs: more carefully set lock bast mode Locks get a bast call from the dlm when a remote node is blocked waiting for the mode of a lock to change. We'd set the mode that we need to convert to and kick off lock work to make forward progress. The bast calls can happen at any old time. If a call came in as we were unlocking a lock we'd set its bast mode even though it was being unlocked and would not need to be down converted. Usually this bad mode would be fine because the lock was idle and would just be freed after being locked. But if someone was actively waiting for the lock it would get stuck in an unlocked state. The bad bast mode would prevent it from being upconverted, but the waiters would stop it from being freed. We fix this by only setting the mode from the bast call if there is really work to do. This avoids setting the bast for unlocked locks which will let the lock state machine re-acquire them and make forward progress on behalf of the waiters. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-02 14:16:50 -07:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	5d9ad0923a	scoutfs: trace net structs The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	53e8ab0f7b	scoutfs: trace extent struct The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dfac36a9aa	scoutfs: trace key struct The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. We also remove the mapping of zone and type to strings. It's smaller to print the values directly and gets rid of some silly code. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5935a3f43e	scoutfs: remove unused trace events These trace events were all orphaned long ago by commits which removed their callers but forgot to remove their definitions. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fddc3a7a75	scoutfs: minimize commit writeback latencies Our simple transaction machinery causes high commit latencies if we let too much dirty file data accumulate. Small files have a natural limit on the amount of dirty data because they have more dirty items per dirty page. They fill up the single segment sooner and kick off a commit which finds a relatively small amount of dirty file data. But large files can reference quite a lot of dirty data with a small amount of extent items which don't fill up the transaction's segment. During large streaming writes we can fill up memory with dirty file data before filling a segment with mapping extent metadata. This can lead to high commit latencies when memory is full of dirty file pages. Regularly kicking off background writeback behind streaming write positions reduces the amount of dirty data that commits will find and have to write out. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	59170f41b1	scoutfs: revive item deletion path The inode deletion path had bit rotted. Delete the ifdefs that were stopping it from deleting all the items associated with an inode. There can be a lot of xattr and data mapping items so we have them manage their own transactions (data already did). The xattr deletion code was trying to get a lock while the caller already held it so delete that. Then we accurately account for the small number of remaining items that finally delete the inode. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	0c7ea66f57	scoutfs: add SIC_EXACT Add an item count call that lets the caller give the exact item count instead of basing it on the operation they're performing. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	002daf3c1c	scoutfs: return -ENOSPC to client alloc segno The server send_reply interface is confusing. It uses errors to shut down the connection. Clients getting enospc needs to happen in the message reply payload. The segno allocation server processing needs to set the segno to 0 so that the client gets it and translates that into -ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	876414065b	scoutfs: warn if we try IO outside the device We've had bugs in allocators that return success and crazy block numbers. The bad block numbers eventually make their way down to the context-free kernel warning that IO was attempted outside the device. This at least gives us a stack trace to help find where it's coming from. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00

1 2 3 4 5 ...

688 Commits