scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-04-30 09:56:55 +00:00

Author	SHA1	Message	Date
Andy Grover	820b7295f0	cleanup: Unused LIST_HEADs Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 16:23:41 -07:00
Zach Brown	3de703757f	Fix weird comment editing error That comment looked very weird indeed until I recognized that I must have forgotten to delete the first two attempts at starting the sentence. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 12:02:05 -07:00
Andy Grover	cf278f5fa0	scoutfs: Tidy some enum usage Prefer named to anonymous enums. This helps readability a little. Use enum as param type if possible (a couple spots). Remove unused enum in lock_server.c. Define enum spbm_flags using shift notation for consistency. Rename get_file_block()'s "gfb" parameter to "flags" for consistency. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-30 13:35:44 -08:00
Andy Grover	e6228ead73	scoutfs: Ensure padding in structs remains zeroed Audit code for structs allocated on stack without initialization, or using kmalloc() instead of kzalloc(). - avl.c: zero padding in avl_node on insert. - btree.c: Verify item padding is zero, or WARN_ONCE. - inode.c: scoutfs_inode contains scoutfs_timespecs, which have padding. - net.c: zero pad in net header. - net.h: scoutfs_net_addr has padding, zero it in scoutfs_addr_from_sin(). - xattr.c: scoutfs_xattr has padding, zero it. - forest.c: item_root in forest_next_hint() appears to either be assigned-to or unused, so no need to zero it. - key.h: Ensure padding is zeroed in scoutfs_key_set_{zeros,ones} Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:33 -07:00
Zach Brown	edd8fe075c	scoutfs: remove lsm code Remove all the now unused code that deals with lsm: segment IO, the item cache, and the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	feaf17c3a5	scoutfs: add conn destroy workq Lockdep gets angry when we try to destroy an accepted conn workqueue from within work in a listening conn's workqueue. It doesn't recognize that they have a hierarchical relationship that maintains a consistent order and we can't get at the workqueue lockdep_map to set subclasses. We add a destroy workqueue which will have its own class. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	ec7f60bebb	scoutfs: net conn lifetime tracing Add trace events for network connections. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	ab7bde9e2c	scoutfs: replace node_id with rid in networking Use the client's rid in networking instead of the node_id. The node_id no longer has to be allocated by the server and sent in the greeting. Instead the client sends it to the server in its greeting. The server then uses the client's announced rid just like it used to use the its node_id. It's used to record clients in the btree and to identify clients in sending and receive processing. The use of the rid in networking calls makes its way to locking and compaction which now use the rid to identify clients intead of the node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	fa3e0a31c7	scoutfs: use SO_REUSEADDR for server socket The server's listening address is fixed by the raft config in the super block. If it shuts down and rapidly starts back up it needs to bind to the currently lingering address. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	0bc0ff9300	scoutfs: add clock sync trace events Generate unique trace events on the send and recv side of each message sent between nodes. This can be used to reasonbly efficiently synchronize the monotonic clock in trace events between nodes given only their captured trace events. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	a546bd0aab	scoutfs: check for newlines in msg.h wrappers The message formatter adds a newline so callers don't have to. But sometimes they do and we get double newlines. Add a build check that the format string doesn't end in a newline so that we stop adding these. And fix up all the current offenders. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	f472c0bc87	scoutfs: add scoutfs_net_response_node() Today all responses can only be sent down the connection that sent the response while the request is being processed. We'll be adding subsystems that need to send responses asynchronously after initial request processing. Give them a call to send a response to a node id instead of to a node's connection. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	6caa87458b	scoutfs: add scoutfs_net_client_node_id() Some upcoming network request processing paths need access to the connected client's node_id. We could add it to the arguments but that'd be a lot of churn so we'll add an accessor function for now. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	8fedfef1cc	scoutfs: remove stale net response data comment There was a time when responding with an error wouldn't include the caller's data payload. That hasn't been the case since we added compaction network requests which include a reference to the compaction operation with the error response. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	62d6c11e3c	scoutfs: clean up workqueue flags We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND in some workqueues where we wanted concurrency by scheduling across cpus instead of waiting for the current (very long running) work on a cpu to finish. We add NON_REENTRANT out of an abundance of caution. It has gone away in modern kernels and is probably not needed here, but according to the docs we would want it so we at least document that fact by using it. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	30d5471e4a	scoutfs: call net response func outside lock Today response processing calls a requests's response callback from inside the net spinlock. This happened to work for the synchronous blocking request handler who only had to record the result and wake their waiter. It doesn't work for server compact response processing which needs to use IO to commit the result of the compaction. This lifts the call to the response function out of complete_send() and into the response processing work function. Other complete_send() callers now won't trigger the response function call and can't see errors, which they all ignored anyway. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	0adbd7e439	scoutfs: have server track connected clients This extends the notify up and down calls to let the server keep track of connected clients. It adds the notion of per-connection info that is allocated for each connection. It's passed to the notification callbacks so that callers can have per-client storage without having to manage allocations in the callbacks. It adds the node_id argument to the notification callbacks to indicate if the call is for the listening socket itself or an accepted client connection on that listening socket. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	746293987c	scoutfs: let server send msg to specific node_id The current sending interfaces only send a message to the peer of a given connection. For the server to send to a specific connected client it'd have to track connections itself and send to them. This adds a sending interface that uses the node_id to send to a specific connected client. The conn argument is the listening socket and its accepted sockets are searched for the destination node_id. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	8b3193ea72	scoutfs: server allocates node_id Today node_ids are randomly assigned. This adds the risk of failure from random number generation and still allows for the risk of collisions. Switch to assigning strictly advancing node_ids on the server during the initial connection greeting message exchange. This simplifies the system and allows us to derive information from the relative values of node_ids in the system. To do this we refactor the greeting code from internal to the net layer to proper client and server request and response processing. This lets the server manage persistent node_id storage and allows the client to wait for a node_id during mount. Now that net_connect is sync in the client we don't need the notify_up callback anymore. The client can perform those duties when the connect returns. The net code still has to snoop on request and response processing to see when the greetings have been exchange and allow messages to flow. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07df8816e3	scoutfs: add trivial seq file for net messages Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	8ff3ef3131	scoutfs: add trivial seq file for net connections Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	c1b2ad9421	scoutfs: separate client and server net processing The networking code was really suffering by trying to combine the client and server processing paths into one file. The code can be a lot simpler by giving the client and server their own processing paths that take their different socket lifecysles into account. The client maintains a single connection. Blocked senders work on the socket under a sending mutex. The recv path runs in work that can be canceled after first shutting down the socket. A long running server work function acquires the listener lock, manages the listening socket, and accepts new sockets. Each accepted socket has a single recv work blocked waiting for requests. That then spawns concurrent processing work which sends replies under a sending mutex. All of this is torn down by shutting down sockets and canceling work which frees its context. All this restructuring makes it a lot easier to track what is happening in mount and unmount between the client and server. This fixes bugs where unmount was failing because the monolithic socket shutdown function was queueing other work while running while draining. Signed-off-by: Zach Brown <zab@versity.com>	2017-08-04 10:47:42 -07:00
Mark Fasheh	a65b28d440	scoutfs: lock impossible ino group for listen lock Otherwise we get into a problem where the listen lock is conflicting with regular inode group requests. Since we never drop the listen lock and it (by design) blocks progress on another node, those inode group requests may hang. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-19 19:04:41 -05:00
Zach Brown	8a42a4d75a	scoutfs: introduce lock names Instead of locking one resource with ranges we'll have callers map their logical resources to a tuple name that we'll store in lock resources. The names still map to ranges for cache reading and cache invalidation but the ranges aren't exposed to the DLM. This lets us use the stock DLM and distribute resources across masters. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	6de2bfc1c5	scoutfs: use the dlm mode/levels directly We intend to use more of the dlm lock levels. Let's use its modes directly so we don't have to maintain a mental map from differently named modes. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	8d29c82306	scoutfs: sort keys by zone, then inode, then type Holding a DLM lock protects a range of the key space. The DLM locks span inodes or regions of inodes. We need the sort order in LSM items to match the DLM range keys so that we can read all the items covered by a lock into the cache from a region of LSM segments. If their orders differered then we'd have to jump around segments to find all the items covered by a given DLM lock. Previously we were sorting by type then, within types, by inode. Now we want to sort by inode then by type. But there are structures which previously had a type but weren't then sorted by inode. We introduce zones as the primary sort key. Inode index and node zones are sorted by the inode fields and node ids respectively. Then comes the fs zone first sorted by inode then the type of the key. The bulk of this is the mechanical introduction of the zone field to the keys, moving the type field down, and a bulk rename of _KEY to _TYPE. But there are some more substantial changes. The orphan keys needed to be put in a zone. They fit in the NODE zone which is all about resources that nodes hold and would need to be cleaned up if the node went away. The key formatting is significantly changed to match the new formatting. Formatted keys are now generally of the form "zone.primary.type..." And finally with the keys now properly sorted by inodes we can correctly construct a single range of item cache keys to invalidate when unlocking the inode group locks. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	690049c293	scoutfs: add GET_MANIFEST_ROOT network op We're going to need to be able to sample the current stable manifest root occasionally. We're adding it now because we don't yet have the lock plumbing that would provide the lvb. Eventually this call will bubble up into the locking and the root will be stored in the lock instead of always requested. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-19 13:30:03 -07:00
Zach Brown	c2f13ccf24	scoutfs: have net.c commit btree blocks Convert the net server metadata dirtying and committing code to use the btree instead of the ring. It has to be careful to setup and teardown the btree info as it starts up and shuts down the server. This fixes up some questionable setup/teardown changes made in the previous patches to convert the manifest and allocator. We could rebase the patches to merge those together. But given that the previous patches don't work at all without the net updates it might not be worth the trouble. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	ff5a094833	scoutfs: store allocator regions in btree Convert the segment allocator to store its free region bitmaps in the btree. This is a very straight forward mechanical transformation. We split the allocator region into a big-endian index key and the bitmap value payload. We're careful to operate on aligned copies of the bitmaps so that they're long aligned. We can remove all the funky functions that were needed when writing the ring. All we're left with is a call to apply the pending allocations to dirty btree blocks before writing the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Zach Brown	fc50072cf9	scoutfs: store manifest entries in the btree Convert the manifest to store entries in persistent btree keys and values instead of using the rbtree in memory from the ring. The btree doesn't have a sort function. It just compares variable length keys. The most complicated part of this transformation is dealing with the fallout of this. The compare function can't compare different search keys and item keys so searches need to construct full synthetic btree keys to search. It also can't return different comparisons, like overlaping, so the caller needs to do a bit more work to use key comparisons to find overlapping segments. And it can't compare differently depending on the level of the manifest so we store the manifest in keys differently depending on whether its in level 0 or not. All mount clients can now see the manifest blocks. They can query the manifest directly when trying to find segments to read. We can get rid of all the networking calls that were finding the segments for readers. We change the manifest functions that relied on the ring that the to make changes in the manifest persistent. We don't touch the allocator or the rest of the manifest server, though, so this commit breaks the world. It'll be restored in future patches as we update the segment allocator and server to work with the btree. Signed-off-by: Zach Brown <zab@versity.com>	2017-07-08 10:59:40 -07:00
Mark Fasheh	e6f3b3ca8f	scoutfs: add lock caching We refcount our locks and hold them across system calls. If another node wants access to a given lock we'll mark it as blocking in the bast and queue a work item so that the lock can later be released. Otherwise locks are free'd under memory pressure, unmount or after a timer fires. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-07-06 18:15:11 -05:00
Zach Brown	f7701177d2	scoutfs: throttle addition of level 0 segments Writers can add level 0 segments much faster (~20x) than compaction can compact them down into the lower levels. Without a limit on the number of level 0 segments item readind can try to read an extraordinary number of level 0 segments and wedge the box nonreclaimable page allocations. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Zach Brown	5f5729b2a4	scoutfs: add sticky compaction As we write segments we're not limiting the number of segments they intersect at the next level. Compactions are limited to a fanout's worth of overlapping segments. This means that we can get a compaction where the upper level segment overlapps more than the segments that are part of the compaction. In this case we can't write the remaining upper level items at the lower level because now we can have a level with segments whose keys intersect. Instead we detect this compaction case. We call it sticky because after merging with the lower level segments the remaining items in the upper level need to stick to the upper level. The next time compaction comes around it'll compact the remaining items with the additional lower overlaping segments. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-27 14:04:38 -07:00
Mark Fasheh	e711c15acf	scoutfs: use dlm for locking To actually use it, we first have to copy symbols over from the dlm build into the scoutfs source directory. Make that happen automatically for us in the Makefile. The only users of locking at the moment are mount, unmount and xattr read/write. Adding more locking calls should be a straight-forward endeavor. The LVB based server ip communication didn't work out, and LVBS as they are written don't make sense in a range locking world. So instead, we record the server ip address in the superblock. This is protected by the listen lock, which also arbitrates which node will be the manifest server. We take and drop the dlm lock on each lock/unlock call. Lock caching will come in a future patch. Signed-off-by: Mark Fasheh <mfasheh@versity.com>	2017-06-23 15:08:02 -05:00
Zach Brown	2bd698b604	scoutfs: set NODELAY and REUSEADDR on net sockets Add a helper that creates a socket and sets nodelay for all sockets and set reuseaddr in listening sockets. Signed-off-by: Zach Brown <zab@versity.com>	2017-06-06 14:29:05 -07:00
Zach Brown	b7bbad1fba	scoutfs: add precise transation item reservations We had a simple mechanism for ensuring that transaction didn't create more items than would fit in a single written segment. We calculated the most dirty items that a holder could generate and assumed that all holders dirtied that much. This had two big problems. The first was that it wasn't accounting for nested holds. write_begin/end calls the generic inode dirtying path whild holding a transaction. This ended up deadlocking as the dirty inode waited to be able to write while its trans held back in write_begin prevented writeout. The second was that the worst case (full size xattr) item dirtying is enormous and meaningfully restricts concurrent transaction holders. With no currently dirty items you can have less than 16 full size xattr writes. This concurrency limit only gets worse as the transaction fills up with dirty items. This fixes those problems. It adds precise accounting of the dirty items that can be created while a transaction is held. These reservations are tracked in journal_info so that they can be used by nested holds. The precision allows much greater concurrency as something like a create will try to reserve a few hundreds bytes instead of 64k. Normal sized xattr operations won't try to reserve the largest possible space. We add some feedback from the item cache to the transaction to issue warnings if a holder dirties more items than it reserved. Now that we have precise item/key/value counts (segment space consumption is a function of all three :/) we can't have a single atomic track transaction holders. We add a long-overdue trans_info and put a proper lock and fields there and much more clearly track transaction serialization amongst the holders and writer. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:15:13 -07:00
Zach Brown	5f11cdbfe5	scoutfs: add and index inode meta and data seqs For each transaction we send a message to to the server asking for a unique sequence number to associate with the transaction. When we change metadata or data of an inode we store the current transaction seq in the inode and we index it with index items like the other inode fields. The server remembers the sequences it gives out. When we go to walk the inode sequence indexes we ask the server for the largest stable seq and limit results to that seq. This ensures that we never return seqs that are past dirty items so never have inodes and seqs appear in the past. Nodes use the sync timer to regularly cycle through seqs and ensure that inode seq index walks don't get stuck on their otherwise idle seq. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-23 12:12:24 -07:00
Zach Brown	373def02f0	scoutfs: remove trade_time message This was mostly just a demonstration for how to add messages. We're about to add a message that we always send on mount so this becomes completely redundant. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-18 10:52:04 -07:00
Zach Brown	c678923401	scoutfs: don't try to sync on mount errors kill_sb tries to sync before calling kill_block_super. It shouldn't do this on mount errors that wouldn't have initialized the higher level systems needed for syncing. Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:12 -07:00
Zach Brown	6afeb97802	scoutfs: reference file data with extent items Our first attempt at storing file data put them in items. This was easy to implement but won't be acceptable in the long term. The cost of the power of LSM indexing is compaction overhead. That's acceptable for fine grained metadata but is totally unacceptable for bulk file data. This switches to storing file data in seperate block allocations which are referenced by extent items. The bulk of the change is the mechanics of working with extents. We have high level callers which add or remove logical extents and then underlying mechanisms that insert, merge, or split the items that the extents are stored in. We have three types of extent items. The primary type maps logical file regions to physical block extents. The next two store free extents per-node so that clients don't create lock and LSM contention as they try and allocate extents. To fill those per-node free extents we add messages that communcate free extents in the form of lists of segment allocations from the server. We don't do any fancy multi-block allocation yet. We only allocate blocks in get_blocks as writes find unmapped blocks. We do use some per-task cursors to cache block allocation positions so that these single block allocations are very likely to merge into larger extents as tasks stream wites. This is just the first chunk of the extent work that's coming. A later patch adds offline flags and fixes up the change nonsense that seemed like a good idea here. The final moving part is that we initiate writeback on all newly allocated extents before we commit the metadata that references the new blocks. We do this with our own dirty inode tracking because the high level vfs methods are unusably slow in some upstream kernels (they walk all inodes, not just dirty inodes.) Signed-off-by: Zach Brown <zab@versity.com>	2017-05-16 10:48:11 -07:00
Zach Brown	d5a2b0a6db	Move towards compaction messages The compaction code is still directly referencing the super block and calling sync methods as though it was still standalone. This is mostly OK because only the server runs it. But it isn't quite right because the sync methods no longer make the rings persistent as they write the item transaction. The server is in control of that now. Eventually we'll have compaction messages being sent between the mount clients and the server. Let's take a step in that direction by having the compaction work call net methods to get its compaction parameters and finish the compaction. Eventually these would be marshalled through request/process/reply code. But in this first step we know that the compaction code is running on the server so we can forgo all the messaging and just call in to and out of compaction. The net calls just holds the ring consistency locks in the server and call into the manifest to do the work, commiting the changes when its done. This is more careful about segno alloction and freeing. Compaction doesn't call the allocator directly. It gets allocaitons from the messages and returns them if it doesn't use them. We actually now free segnos as they're removed from the manifest. With the server controlling compaction and can tear all the fiddly level count watching code out of the manifest. Item transactions can't care about the level counts and the server always tries compaction after the manifest is updated intead of having the manifest watch the level counts and call compaction. Now that the server owns the rings they should not be torn down as the super is torn down, net does that now. And we need to be more careful to be sure that writes from dirtying and compaction are stable before killing the super. With all this in place moving to shared compaction involves adding the messages and negotiating concurrent compactions in the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-24 14:02:18 -07:00
Zach Brown	cec3f9468a	Further isolate rings and compaction Each mount was still loading the manifest and allocator rings and starting compaction, even if they were coordinating segment reads and writes with the server. This moves ring and compaction setup and teardown from on mount and unmount to as the server starts up and shuts down. Now only the server has the rings resident and is running compaction. We had to null some of the super info fields so that we can repeatedly load and destroy the ring indices over the lifetime of a mount. We also have to be careful not to call between item transactions and compaction. We'll restore this functionality with the server in the future. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	5eefaf34f8	Server updates ring for level0 segment writes Transaction commits currently directly modify the ring and super block as segments are written. As we introduce shared mounts only the server can modify the ring and super blocks. This adds network messages to let mounts write items in a level 0 segment while the server modifies the allocator and manifest. The item transaction commit now sends a message to the server to get an allocated segno for its new level0 segment and sends a manifest entry to the server once the segment is written. The request and reply handlers for the functions are straight forward. The processing paths are simple wrappers around the allocation and update functions that transaction writing used to call directly. Now that the item transactions aren't updating the super sync can't work with the super sequence numbers. The server needs to make both allocations and manifest updates persistent before it sends replies to the client. We add the ability for the server processing paths to queue and wait for commits of the rings and super block. We can hopefull get reasonable batching by using a work struct for the commit. We update the other processing path callers that modify the rings to use the new commit mechanism. We add a few segment and manifest functions to work with manifest entries that describe segments. This creats a bit of similar looking code thorughout the segment and manifest code but we'll come back and clean this up once we see what the final shared support looks like. scoutfs_seg_alloc() now takes the segno from the caller for the segment it's allocating and inserting into the cache. Transaction commit uses the segno it got from the server while compaction still allocates locally. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	5487aee6a7	Read items with manifest entries from server Item reading tries to directly walk the manifest to find segments to read. That doesn't work when only the server has read the ring and loaded the manifest. This adds a network message to ask the server for the manifest entries that describe the segments that will be needed to read items. Previously item reading would walk the manifest and build up native manifest references in a list that it'd use to read. To implement the network message we add request sending, processing, and reply parsing around those original functions. Item reading now packs its key range and sends it to the server. The server walks the manifest and sends the entries that intersect with the key range. Then the reply function builds up the native manifest references that item reading will use. The net reply functions needed an argument so that the manifest reading request could pass in the caller's list that the native manifest references should be added to. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00
Zach Brown	b50de90196	Alloc inodes from pool from server Inode allocation was always modifying the in-memory super block. This doesn't work when the server is solely responsible for modifying the super blocks. We add network messages to have mounts send a message to the server to request inodes that they can use to satisfy allocation. Signed-off-by: Zach Brown <zab@versity.com>	2017-04-18 13:51:10 -07:00

1 2

52 Commits