scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-07 03:00:44 +00:00

Author	SHA1	Message	Date
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	62d6c11e3c	scoutfs: clean up workqueue flags We had gotten a bit sloppy with the workqueue flags. We needed _UNBOUND in some workqueues where we wanted concurrency by scheduling across cpus instead of waiting for the current (very long running) work on a cpu to finish. We add NON_REENTRANT out of an abundance of caution. It has gone away in modern kernels and is probably not needed here, but according to the docs we would want it so we at least document that fact by using it. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	30d5471e4a	scoutfs: call net response func outside lock Today response processing calls a requests's response callback from inside the net spinlock. This happened to work for the synchronous blocking request handler who only had to record the result and wake their waiter. It doesn't work for server compact response processing which needs to use IO to commit the result of the compaction. This lifts the call to the response function out of complete_send() and into the response processing work function. Other complete_send() callers now won't trigger the response function call and can't see errors, which they all ignored anyway. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	00adbd31be	scoutfs: add sparse bitmap library Add a quick library for maintaining a very large bitmap with sparse allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	1ed0c6017f	scoutfs: remove unused keys manifest field Keys used to be variable length so the manifest struct on the wire ended in key payloads. The keys are now fixed size so that field is no longer necessary or used. It's an artifact that should have been removed when the keys were made fixed length. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	0adbd7e439	scoutfs: have server track connected clients This extends the notify up and down calls to let the server keep track of connected clients. It adds the notion of per-connection info that is allocated for each connection. It's passed to the notification callbacks so that callers can have per-client storage without having to manage allocations in the callbacks. It adds the node_id argument to the notification callbacks to indicate if the call is for the listening socket itself or an accepted client connection on that listening socket. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	746293987c	scoutfs: let server send msg to specific node_id The current sending interfaces only send a message to the peer of a given connection. For the server to send to a specific connected client it'd have to track connections itself and send to them. This adds a sending interface that uses the node_id to send to a specific connected client. The conn argument is the listening socket and its accepted sockets are searched for the destination node_id. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	8b3193ea72	scoutfs: server allocates node_id Today node_ids are randomly assigned. This adds the risk of failure from random number generation and still allows for the risk of collisions. Switch to assigning strictly advancing node_ids on the server during the initial connection greeting message exchange. This simplifies the system and allows us to derive information from the relative values of node_ids in the system. To do this we refactor the greeting code from internal to the net layer to proper client and server request and response processing. This lets the server manage persistent node_id storage and allows the client to wait for a node_id during mount. Now that net_connect is sync in the client we don't need the notify_up callback anymore. The client can perform those duties when the connect returns. The net code still has to snoop on request and response processing to see when the greetings have been exchange and allow messages to flow. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	f06b39cd7e	scoutfs: destroy items after locks We were destroying the item subsystem before shutting down locking. This is wrong because locking shutdown invalidates items covered by the locks. It can walk into freed memory and crash or corrupt other memory. The fix is to tear down the item subsystem after tearing down locks. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-24 15:40:53 -07:00
Zach Brown	ed9f4b6a22	scoutfs: calculate and enforce segment csum We had fields in the segment header for the crc but weren't using it. This calculates the crc on write and verifies it on read. The crc covers the used bytes in the segment as indicated by the total_bytes field. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-21 13:28:36 -07:00
Zach Brown	a25b6324d2	scoutfs: maintain free_blocks in one place The free_blocks counter in the super is meant to track the number of total blocks in the primary free extent index. Callers of extent manipulation were trying to keep it in sync with the extents. Segment allocation was allocating extents manually using a cursor. It forgot to update free_blocks. Segment freeing then freed the segment as an extent which did update free_blocks. This created ever accumulating free blocks over time which eventually pushed it greater than total blocks and caused df to report negative usage. This updates the free_blocks count in server extent io which is the only place we update the extent items themselves. This ensures that we'll keep the count in sync with the extent items. Callers don't have to worry about it. Signed-off-by: Zach Brown <zab@versity.com> T# with '#' will be ignored, and an empty message aborts the commit.	2018-08-21 13:25:05 -07:00
Zach Brown	a72b7a9001	scoutfs: convert locks seq to trivial seq Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	07df8816e3	scoutfs: add trivial seq file for net messages Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	bafa4a6720	scoutfs: add net header printk args We have macros for creating and printing trace arguments for our network header struct. Add a macro for making simple printk call args for normal formatted output callers. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	8ff3ef3131	scoutfs: add trivial seq file for net connections Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	c4cb5c0651	scoutfs: add trivial seq file wrapper Add a seq file wrapper which lets callers track objects easily. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	d708421cfb	scoutfs: remove unused client and server code The previous commit added shared networking code and disabled the old unused code. This removes all that unused client and server code that was refactored to become the shared networking code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	295bf6b73b	scoutfs: return free extents to server Freed file data extents are tracked in free extent items in each node. They could only be re-used in the future for file data extent allocation on that node. Allocations on other nodes or, critically, segment allocation on the server could never see those free extents. With the right allocation patterns, particularly allocating on node X and freeing on node Y, all the free extents can build up on a node and starve other allocations. This adds a simple high water mark after which nodes start returning free extents to the server. From there they can satisfy segment allocations or be sent to other nodes for file data extent allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-05 16:19:31 -07:00
Zach Brown	784cda9bee	scoutfs: more carefully set lock bast mode Locks get a bast call from the dlm when a remote node is blocked waiting for the mode of a lock to change. We'd set the mode that we need to convert to and kick off lock work to make forward progress. The bast calls can happen at any old time. If a call came in as we were unlocking a lock we'd set its bast mode even though it was being unlocked and would not need to be down converted. Usually this bad mode would be fine because the lock was idle and would just be freed after being locked. But if someone was actively waiting for the lock it would get stuck in an unlocked state. The bad bast mode would prevent it from being upconverted, but the waiters would stop it from being freed. We fix this by only setting the mode from the bast call if there is really work to do. This avoids setting the bast for unlocked locks which will let the lock state machine re-acquire them and make forward progress on behalf of the waiters. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-02 14:16:50 -07:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	5d9ad0923a	scoutfs: trace net structs The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	53e8ab0f7b	scoutfs: trace extent struct The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dfac36a9aa	scoutfs: trace key struct The userspace trace event printing code has trouble with arguments that refer to fields in entries. Add macros to make entries for all the fields and use them as the formatted arguments. We also remove the mapping of zone and type to strings. It's smaller to print the values directly and gets rid of some silly code. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5935a3f43e	scoutfs: remove unused trace events These trace events were all orphaned long ago by commits which removed their callers but forgot to remove their definitions. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fddc3a7a75	scoutfs: minimize commit writeback latencies Our simple transaction machinery causes high commit latencies if we let too much dirty file data accumulate. Small files have a natural limit on the amount of dirty data because they have more dirty items per dirty page. They fill up the single segment sooner and kick off a commit which finds a relatively small amount of dirty file data. But large files can reference quite a lot of dirty data with a small amount of extent items which don't fill up the transaction's segment. During large streaming writes we can fill up memory with dirty file data before filling a segment with mapping extent metadata. This can lead to high commit latencies when memory is full of dirty file pages. Regularly kicking off background writeback behind streaming write positions reduces the amount of dirty data that commits will find and have to write out. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	59170f41b1	scoutfs: revive item deletion path The inode deletion path had bit rotted. Delete the ifdefs that were stopping it from deleting all the items associated with an inode. There can be a lot of xattr and data mapping items so we have them manage their own transactions (data already did). The xattr deletion code was trying to get a lock while the caller already held it so delete that. Then we accurately account for the small number of remaining items that finally delete the inode. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	0c7ea66f57	scoutfs: add SIC_EXACT Add an item count call that lets the caller give the exact item count instead of basing it on the operation they're performing. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	002daf3c1c	scoutfs: return -ENOSPC to client alloc segno The server send_reply interface is confusing. It uses errors to shut down the connection. Clients getting enospc needs to happen in the message reply payload. The segno allocation server processing needs to set the segno to 0 so that the client gets it and translates that into -ENOSPC. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	876414065b	scoutfs: warn if we try IO outside the device We've had bugs in allocators that return success and crazy block numbers. The bad block numbers eventually make their way down to the context-free kernel warning that IO was attempted outside the device. This at least gives us a stack trace to help find where it's coming from. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	2efba47b77	scoutfs: satisfy large allocs with smaller extents The previous fallocate and get_block allocators only looked for free extents larger than the requested allocation size. This prematurely returns -ENOSPC if a very large allocation is attempted. Some xfstests stress low free space situations by fallocating almost all the free space in the volume. This adds an allocation helper function that finds the biggest free extent to satisfy an allocation, psosibly after trying to get more free extents from the server. It looks for previous extents in the index of extents by length. This builds on the previously added item and extent _prev operations. Allocators need to then know the size of the allocation they got instead of assuming they got what they asked for. The server can also return a smaller extent so it needs to communicate the extent length, not just its start. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	04660dbfee	scoutfs: add scoutfs_extent_prev() Add an extent function for iterating backwards through extents. We add the wrapper and have the extent IO functions call their storage _prev functions. Data extent IO can now call the new scoutfs_item_prev(). Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	d53ec115bc	scoutfs: add scoutfs_item_prev() Add scoutfs_item_prev() for searching for an item before a given key. This wasn't initially implemented because it's rarely needed and for a long time the segment reading and item cache populating code had a strong bias for iterating forward from the given search key. Since we've added limiting item cache reading to the keys covered by locks and reading in entire segments it's now very easy to iterate backwards through keys just like scoutfs_item_next() iterates forwards. The only remaining forward iteration bias was in check_range(). It had to give callers the start of the cached range that it found. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	600ecd9fad	scoutfs: adapt to fallcated extents The addition of fallocate() now means that offline extents can be unwritten and allocated and that extents can now be found outside of i_size. Truncating needs to know about the possible flag combinations, writing preallocation needs to know to update an existing extent or allocate up to the next extent, get_block can't map unwritten extents for read, extent conversion needs to also clear offline, and truncate needs to drop extents outside i_size even if truncating to the existing file size. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1fca13b092	scoutfs: add fallocate Add an fallocate operation. This changes the possible combinations of flags in extents and makes it possible to create extents beyond i_size. This will confuse the rest of the code in a few places and that will be fixed up next. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dab0fd7d9a	scoutfs: update inode item after releasing The release ioctl forgot to update the inode item after truncating online block mappings. This meant that the offline block count update was lost when the inode was evicted and re-read, leading to inconsistent offline block counts. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9c74f2011d	scoutfs: add server work tracing Add some server workqueue and work tracing to chase down the destruction of an active workqueue. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	08a6fab725	scoutfs: always trace item create/delete ret Add a trace event for item creation and always trace the return value of create and delete events. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	27d1f3bcf7	scoutfs: inode read shouldn't modify online blocks There was a typo in the addition of i_blocks tracking that would set online blocks to the value of offline blocks when reading an existing inode into memory. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	9c80f109d5	scoutfs: don't always write deletion items Items deleted from the item cache would always write deletion items to segments. We need to write deletion items so that compaction can eventually combine them with the existing item and remove both. We don't need them for items that were only created in the current transaction. Writing a deletion item for them only results in a lot of extra work compacting the item down to the final segment level so that it can be removed. The upcoming extent code really demonstrated the cost of this overhead. It happens to create and delete quite a lot of temporary extent items during the transaction as all the different kinds of indexed extents change. This change tracks whether a given item in the cache reflects an item that is present in the persistent storage. This lets us free items that have only existed in the current transaction. This made a meaningful difference when writing a 4MB file with the current block mapping items, but it made an enormous difference when writing that same file with the extent items. It went from writing 1024 deletion items for 11 real items to only writing those real items. items deletions block mappings before: 25 5 block mappings after: 25 0 extents before: 11 1024 extents after: 11 0 Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1c5d84fa3e	scoutfs: add counters for items written in level 0 Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	e227c6446e	scoutfs: don't advance btree after wrapping The btree writes its blocks to a fixed ring of preallocated blocks. We added a trigger to force the index to advance to the next half of the ring to test conditions where the cached btree blocks are out of date with respect to the blocks on disk. We have to be careful to only advance the index once all the live blocks are migrated out of the half that we're about to advance to. The trigger tested that condition. But it missed the case where the normal btree block allocation just advanced into the next ring. In this case the migration needs to occur to make it safe to advance again to the previous half. But it missed this case because the migration keys are reset after we test the trigger. This resulted in leaving live btree blocks in the half that we advance to and start overwriting. The server got -ESTALE as it tried to read through blocks that had been overwritten and hilarity ensued. This precise condition of having the trigger fire just as we wrapped was amazingly caught by scoutfs/505 in xfstests. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	345721c933	scoutfs: preserve sticky deletion items We limit the number of lower segments that a compaction will read. A sticky compaction happens when the upper segment overlaps more lower segments. The remaining items in the upper segment are written back to the upper level -- they're stuck. A future compaction will attempt to compact the remaining items with the next set of overlapping lower segments. Deletion items are rightly discarded as they're compacted to the lowest level -- at that point they have no more matching items in lower segments to destroy and are done. Deletion items were being dropped instead of being written back into the upper level of a sticky compaction. The test for discarding the deletion items only considered the lowest level of the compaction, not the level that the items were being written to. We need to be careful to preserve the deletion items in the case of compaction to the lowest level writing sticky items back to the upper segment. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5f0c87970c	scoutfs: fix level 0 key iteration increment Compaction has to find the oldest level 0 segment for compaction. It iterates over the level 0 segments by their manifest entry's btree key. It was incorrectly incrementing the btree search key. It was incrementing the first key stored in the entry, but that's not the least significant field. The seq is the least significant field so this iteration could skip over segments written at different times with the same first key. The fix to have it visit all the entries is to increment the lowest precision seq field. Right now we have a single level 0 segment so this code never actually matters. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	41c29c48dd	scoutfs: add extent corruption cases The extent code was originally written to panic if it hit errors during cleanup that resulted in inconsistent metadata. The more reasonble strategy is to warn about the corruption and act accordingly and leave it to corrective measures to resolve the corruption. In this case we continue returning the error that caused us to try and clean up. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	874a44aef0	scoutfs: remove dead file allocation cursor code This is no longer used now that we allocate large extents for concurrently extending files by preallocating unwritten extents. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fe94eb7363	scoutfs: add unwritten extents Now that we have extents we can address the fragmentation of concurrent writes with large preallocated unwritten extents instead of trying to allocate from disjoint free space with cursors. First we add support for unwritten extents. Truncate needs to make sure it doesn't treat truncated unwritten blocks as online just because they're not offline. If we try to write into them we convert them to written extents. And fiemap needs to flag them as unwritten and be sure to check for extents past i_size. Then we allocate unwritten extents only if we're extending a contiguous file. We try to preallocate the size of the file and cap it to a meg. This ends up with a power of two progression of preallocation sizes, which nicely balances extent sizes and wasted allocation as file sizes increase. We need to be careful to truncate the preallocated regions if the entire file is released. We take that as an indication that the user doesn't want the file consuming any more space. This removes most of the use of the cursor code. It will be completely removed in a further patch. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	dd091e18a9	scoutfs: add trans item tracking trace Add a trace event that records the changes to a reservation's dirty item count. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00

1 2 3 4 5 ...

668 Commits