scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-13 05:51:07 +00:00

Author	SHA1	Message	Date
Zach Brown	6f5cfd8cc2	scoutfs: use rid instead of node_id in items Use the mount's generated random id in persistent items and the lock that protects them instead of the assigned node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	7dfbd3950f	scoutfs: add index of inodes by xattr names Add a .indx. xattr tag which adds the inode to an index of inodes keyed by the hash of xattr names. An ioctl is added which then returns all the inodes which may contain an xattr of the given name. Dropping all xattrs now has to parse the name to find out if it also has to delete an index item. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	c061ada671	scoutfs: mounts connect once server is listening An elected leader writes a quorum block showing that it's elected before it assumes exclusive access to the device and starts bringing up the server. This lets another later elected leader find and fence it if something happens. Other mounts were trying to connect to the server once this elected quorum block was written and before the server was listening. They'd get conection refused, decide to elect a new leader, and try to fence the server that's still running. Now, they should have tried much harder to connect to the elected leader instead of taking a single failed attempt as fatal. But that's a problem for another day that involves more work in balancing timeouts and retries. But mounts should not have tried try to connect to the server until its listening. That's easy to signal by adding a simple listening flag to the quorum block. Now mounts will only try to connect once they see the listening flag and don't see these racey refused connections. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 15:01:00 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	b5133bfc98	scoutfs: add elected flag to quorum block It was a mistake to use a non-zero elected_nr as the indication that a slot is considered actively elected. Zeroing it as the server shuts down wipes the elected_nr and means that it doesn't advance as each server is elected. This then causes a client connecting to a new server to be confused for a client reconnecting to a server after the server has timed it out and destroyed its state. This caused reconnection after shutting down a server to fail and clients to loop reconnecting indefinitely. This instead adds flags to the quorum block and assigns a flag to indicate that the slot should be considered active. It's cleared by fencing and by the client as the server shuts down. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e88b5732ad	scoutfs: track trans seq in btree Currently the server tracks the outstanding transaction sequence numbers that clients have open in a simple list in memory. It's not properly cleaned up if a client unmounts and a new server that takes over after a crash won't know about open transaction sequence numbers. This stores open transaction sequence numbers in a shared persistent btree instead of in memory. It removes tracking for clients as they send their farewell during unmount. A new server that starts up will see existing entries for clients that were created by old servers. This fixes a bug where a client who unmounts could leave behind a pending sequence number that would never be cleaned up and would indefinitely limit the visibility of index items that came after it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	0bc0ff9300	scoutfs: add clock sync trace events Generate unique trace events on the send and recv side of each message sent between nodes. This can be used to reasonbly efficiently synchronize the monotonic clock in trace events between nodes given only their captured trace events. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	20f4e1c338	scoutfs: put magic value in block header The super block had a magic value that was used to identify that the block should contain our data structure. But it was called an 'id' which was confused with the header fsid in the past. Also, the btree blocks aren't using a similar magic value at all. This moves the magic value in to the header and creates values for the super block and btree blocks. Both are written but the btree block reads don't check the value. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	288d781645	scoutfs: start and stop server with quorum Currently all mounts try to get a dlm lock which gives them exclusive access to become the server for the filesystem. That isn't going to work if we're moving to locking provided by the server. This uses quorum election to determine who should run the server. We switch from long running server work blocked trying to get a lock to calls which start and stop the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	34b8950bca	scoutfs: initial lock server core Add the core lock server code for providing a lock service from our server. The lock messages are wired up but nothing calls them. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	c34dd452a7	scoutfs: add quorum voting Add a quorum election implementation. The mounts that can participate in the election are specified in a quorum config array in the super block. Each configured participant is assigned a preallocated block that it can write to. All mounts read the quorum blocks to find the member who was elected the leader and should be running the server. The voting mounts loop reading voting blocks and writing their vote block until someone is elected with a amjority. Nothing calls this code yet, this adds the initial implementation and format. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	d57b8232ee	scoutfs: move base types in format.h We had scattered some base types throughout the format file which made them annoying to reference in higher level structs. Let's put them at the top so we can use them without declarations or moving things around in unrelated commits. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e9f6e79d67	scoutfs: add uniq_name mount option Each mount is getting a specified unique name. This can be used to identify a reconnecting mount that indicates that an old instance of the same unique name can no longer exist and doesn't need to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	2cc990406a	scoutfs: compact using net requests Currently compaction is only performed by one thread running in the server. Total metadata throughput of the system is limited by only having one compaction operation in flight at a time. This refactors the compaction code to have the server send compaction requests to clients who then perform the compaction and send responses to the server. This spreads compaction load out amongst all the clients and greatly increases total compaction throughput. The manifest keeps track of compactions that are in flight at a given level so that we maintain segment count invariants with multiple compactions in flight. It also uses the sparse bitmap to lock down segments that are being used as inputs to avoid duplicating items across two concurrent compactions. A server thread still coordinates which segments are compacted. The search for a candidate compaction operation is largely unchanged. It now has to deal with being unable to process a compaction because its segments are busy. We add some logic to keep searching in a level until we find a compaction that doesn't intersect with current compaction requests. If there are none at the level we move up to the next level. The server will only issue a given number of compaction requests to a client at a time. When it needs to send a compaction request it rotates through the current clients until it finds one that doesn't have the max in flight. If a client disconnects the server forgets the compactions it had sent to that client. If those compactions still need to be processed they'll be sent to the next client. The segnos that are allocated for compaction are not reclaimed if a client disconnects or the server crashes. This is a known deficiency that will be addressed with the broader work to add crash recovery to the multiple points in the protocol where the server and client trade ownership of persistent state. The server needs to block as it does work for compaction in the notify_up and response callbacks. We move them out from under spin locks. The server needs to clean up allocated segnos for a compaction request that fails. We let the client send a data payload along with an error response so that it can give the server the id of the compaction that failed. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	07eec357ee	scoutfs: simplify reliable request delivery It was a bit of an overreach to try and limit duplicate request processing in the network layer. It introduced acks and the necessity to resync last_processed_id on reconnect. In testing compaction requests we saw that request processing stopped if a client reconnected to a new server. The new server sent low request ids which the client dropped because they were lower than the ids it got from the last server. To fix this we'd need to add smarts to reset ids when connecting to new servers but not existing servers. In thinking about this, though, there's a bigger problem. Duplicate request processing protection only works up in memory in the networking connections. If the server makes persistent changes, then crashes, the client will resend the request to the new server. It will need to discover that the persistent changes have already been made. So while we protected duplicate network request processing between nodes that reconnected, we didn't protect duplicate persistent side-effects of request processing when reconnecting to a new server. Once you see that the request implementations have to take this into account then duplicate request delivery becomes a simpler instance of this same case and will be taken care of already. There's no need to implement the complexity of protecting duplicate delivery between running nodes. This removes the last_processed_id on the server. It removes resending of responses and acks. Now that ids can be processed out of order we remove the special known ID of greeting commands. They can be processed as usual. When there's only request and response packets we can differentiate them with a flag instead of a u8 message type. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	1ed0c6017f	scoutfs: remove unused keys manifest field Keys used to be variable length so the manifest struct on the wire ended in key payloads. The keys are now fixed size so that field is no longer necessary or used. It's an artifact that should have been removed when the keys were made fixed length. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	8b3193ea72	scoutfs: server allocates node_id Today node_ids are randomly assigned. This adds the risk of failure from random number generation and still allows for the risk of collisions. Switch to assigning strictly advancing node_ids on the server during the initial connection greeting message exchange. This simplifies the system and allows us to derive information from the relative values of node_ids in the system. To do this we refactor the greeting code from internal to the net layer to proper client and server request and response processing. This lets the server manage persistent node_id storage and allows the client to wait for a node_id during mount. Now that net_connect is sync in the client we don't need the notify_up callback anymore. The client can perform those duties when the connect returns. The net code still has to snoop on request and response processing to see when the greetings have been exchange and allow messages to flow. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-28 15:34:30 -07:00
Zach Brown	ed9f4b6a22	scoutfs: calculate and enforce segment csum We had fields in the segment header for the crc but weren't using it. This calculates the crc on write and verifies it on read. The crc covers the used bytes in the segment as indicated by the total_bytes field. Signed-off-by: Zach Brown <zab@versity.com>	2018-08-21 13:28:36 -07:00
Zach Brown	17dec65a52	scoutfs: add bidirectional network messages The client and server networking code was a bit too rudimentary. The existing code only had support for the client synchronously and actively sending requests that the server could only passively respond to. We're going to need the server to be able to send requests to connected clients and it can't block waiting for responses from each one. This refactors sending and receiving in both the client and server code into shared networking code. It's built around a connection struct that then holds the message state. Both peers on the connection can send requests and send responses. The existing code only retransmitted requests down newly established connections. Requests could be processed twice. This adds robust reliability guarantees. Requests are resend until their response is received. Requests are only processed once by a given peer, regardless of the connection's transport socket. Responses are reiably resent until acknowledged. This only adds the new refactored code and disables the old unused code to keep the diff foot print minmal. A following commit will remove all the unused code. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-27 09:50:21 -07:00
Zach Brown	295bf6b73b	scoutfs: return free extents to server Freed file data extents are tracked in free extent items in each node. They could only be re-used in the future for file data extent allocation on that node. Allocations on other nodes or, critically, segment allocation on the server could never see those free extents. With the right allocation patterns, particularly allocating on node X and freeing on node Y, all the free extents can build up on a node and starve other allocations. This adds a simple high water mark after which nodes start returning free extents to the server. From there they can satisfy segment allocations or be sent to other nodes for file data extent allocation. Signed-off-by: Zach Brown <zab@versity.com>	2018-07-05 16:19:31 -07:00
Zach Brown	e19716a0f2	scoutfs: clean up super block use The code that works with the super block had drifted a bit. We still had two from an old design and we weren't doing anything with its crc. Move to only using one super block at a fixed blkno and store and verify its crc field by sharing code with the btree block checksumming. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 15:56:42 -07:00
Zach Brown	2efba47b77	scoutfs: satisfy large allocs with smaller extents The previous fallocate and get_block allocators only looked for free extents larger than the requested allocation size. This prematurely returns -ENOSPC if a very large allocation is attempted. Some xfstests stress low free space situations by fallocating almost all the free space in the volume. This adds an allocation helper function that finds the biggest free extent to satisfy an allocation, psosibly after trying to get more free extents from the server. It looks for previous extents in the index of extents by length. This builds on the previously added item and extent _prev operations. Allocators need to then know the size of the allocation they got instead of assuming they got what they asked for. The server can also return a smaller extent so it needs to communicate the extent length, not just its start. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1fca13b092	scoutfs: add fallocate Add an fallocate operation. This changes the possible combinations of flags in extents and makes it possible to create extents beyond i_size. This will confuse the rest of the code in a few places and that will be fixed up next. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	41c29c48dd	scoutfs: add extent corruption cases The extent code was originally written to panic if it hit errors during cleanup that resulted in inconsistent metadata. The more reasonble strategy is to warn about the corruption and act accordingly and leave it to corrective measures to resolve the corruption. In this case we continue returning the error that caused us to try and clean up. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fe94eb7363	scoutfs: add unwritten extents Now that we have extents we can address the fragmentation of concurrent writes with large preallocated unwritten extents instead of trying to allocate from disjoint free space with cursors. First we add support for unwritten extents. Truncate needs to make sure it doesn't treat truncated unwritten blocks as online just because they're not offline. If we try to write into them we convert them to written extents. And fiemap needs to flag them as unwritten and be sure to check for extents past i_size. Then we allocate unwritten extents only if we're extending a contiguous file. We try to preallocate the size of the file and cap it to a meg. This ends up with a power of two progression of preallocation sizes, which nicely balances extent sizes and wasted allocation as file sizes increase. We need to be careful to truncate the preallocated regions if the entire file is released. We take that as an indication that the user doesn't want the file consuming any more space. This removes most of the use of the cursor code. It will be completely removed in a further patch. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	1b3645db8b	scoutfs: remove dead server allocator code Remove the bitmap segno allocator code that the server used to use to manage allocations. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	c01a715852	scoutfs: use extents in the server allocator Have the server use the extent core to maintain free extent items in the allocation btree instead of the bitmap items. We add a client request to allocate an extent of a given length. The existing segment alloc and free now work with a segment's worth of blocks. The server maintains counters in the super block of free blocks instead of free segments. We maintain an allocation cursor so that allocation results tend to cycle through the device. It's stored in the super so that it is maintained across server instances. This doesn't remove unused dead code to keep the commit from getting too noisy. It'll be removed in a future commit. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	5eddd10eb7	scoutfs: remove dead block mapping code Remove all the code for tracking block mapping items and storing free blocks in bitmaps. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	abbe76093b	scoutfs: store file data in extents Store file data mappings and free block ranges in extents instead of in block mapping items and bitmaps. This adds the new functionality and refactors the functions that use it. The old functions are no longer called and we stop at ifdeffing them out to keep the change small. We'll remove all the dead code in a future change. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	869d11fd0f	scoutfs: add core extent functions Add a file of extent functions that callers will use to manipulate and store extents in different persistent formats. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	036577890f	scoutfs: add atomic online/offline blocks calls Add functions that atomically change and query the online and offline block counts as a pair. They're semantically linked and we shouldn't present counts that don't match if they're in the process of being updated. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	4fc554584a	scoutfs: add SCOUTFS_BLOCK_MAX Add the max possible logical block / physical blkno number given u64 bytes recorded at block size granularity. Signed-off-by: Zach Brown <zab@versity.com>	2018-06-29 14:42:06 -07:00
Zach Brown	fe8b155061	scoutfs: add btree corruption messages Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	3efcc87413	scoutfs: add corruption messages for namei Add scoutfs_corruption() calls for corruption associated with mapping names to inodes. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	c9573d13bb	scoutfs: add scoutfs_corruption() Add a helper for printing a message warning about corruption. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-27 09:06:35 -07:00
Zach Brown	c118f7cc03	scoutfs: add option to force tiny btree blocks Add a tunable option to force using tiny btree blocks on an active mount. This lets us quickly exercise large btrees. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	e145267c05	scoutfs: allow smaller btree keys and values Now that we're using small file system keys we can dramatically shrink the maximum allowed btree keys and values. This more accurately matches the current users and less us fit more possible items in each block. Which lets us turn the block size way down and still have multiple worst case largest items per block. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-13 08:59:03 -07:00
Zach Brown	9148f24aa2	scoutfs: use single small key struct Variable length keys lead to having a key struct point to the buffer that contains the key. With dirents and xattrs now using small keys we can convert everyone to using a single key struct and significantly simplify the system. We no longer have a seperate generic key buf struct that points to specific per-type key storage. All items use the key struct and fill out the appropriate fields. All the code that paired a generic key buf struct and a specific key type struct is collapsed down to a key struct. There's no longer the difference between a key buf that shares a read-only key, has it's own precise allocation, or has a max size allocation for incrementing and decrementing. Each key user now has an init function fills out its fields. It looks a lot like the old pattern but we no longer have seperate key storage that the buf points to. A bunch of code now takes the address of static key storage instead of managing allocated keys. Conversely, swapping now uses the full keys instead of pointers to the keys. We don't need all the functions that worked on the generic key buf struct because they had different lengths. Copy, clone, length init, memcpy, all of that goes away. The item API had some functions that tested the length of keys and values. The key length tests vanish, and that gets rid of the _same() call. The _same_min() call only had one user who didn't also test for the value length being too large. Let's leave caller key constraints in callers instead of trying to hide them on the other side of a bunch of item calls. We no longer have to track the number of key bytes when calculating if an item population will fit in segments. This removes the key length from reservations, transactions, and segment writing. The item cache key querying ioctls no longer have to deal with variable length keys. The simply specify the start key, the ioctls return the number of keys copied instead of bytes, and the caller is responsible for incrementing the next search key. The segment no longer has to store the key length. It stores the key struct in the item header. The fancy variable length key formatting and printing can be removed. We have a single format for the universal key struct. The SK_ wrappers that bracked calls to use preempt safe per cpu buffers can turn back into their normal calls. Manifest entries are now a fixed size. We can simply split them between btree keys and values and initialize them instead of allocating them. This means that level 0 entries don't have their own format that sorts by the seq. They're sorted by the key like all the other levels. Compaction needs to sweep all of them looking for the oldest and read can stop sweeping once it can no longer overlap. This makes rare compaction more expensive and common reading less expensive, which is the right tradeoff. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	df6a8af71f	scoutfs: remove name from dirent keys Directory entries were the last items that had large variable length keys because they stored the entry name in the key. We'd like to have small fixed size keys so let's store dirents with small keys. Entries for lookup are stored at the hash of the name instead of the full name. The key also contains the unique readdir pos so that we don't have to deal with collision on creation. The lookup procedure now does need to iterate over all the readdir positions for the hash value and compare the names. Entries for link backref walking are stored with the entry's position in the parent dir instead of the entry's name. The name is then stored in the value. Inode to path conversion can still walk the backref items without having to lookup dirent items. These changes mean that all directory entry items are now stored at a small key with some u64s (hash, pos, parent dir, etc) and have a value with the dirent struct and full entry name. This lets us use the same key and value format for the three entry key types. We no longer have to allocate keys, we can store them on the stack. We store the entry's hash and pos in the dirent struct in the item value so that any item has all the fields to reference all the other item keys. We store the same values in the dentry_info so that deletion (unlink and rename) can find all the entries. The ino_path ioctl can now much more clearly iterate over parent directories and entry positions instead of oh so cleverly iterating over null terminated names in the parent directories. The ioctl interface structs and implementation become simpler. Signed-off-by: Zach Brown <zab@versity.com>	2018-04-04 09:15:27 -05:00
Zach Brown	c4de85fd82	scoutfs: cleanup xattr item storage Honoring the XATTR_REMOVE flag in xattr deletion exposed an interesting bug in getxattr(). We were unconditinally returning the max xattr value size when someone tried to probe an existing xattrs' value size by calling getxattr with size == 0. Some kernel paths did this to probe the existance of xattrs. They expected to get an error if the xattr didn't exist, but we were giving them the max possible size. This kernel path then tried to remove the xattrs with XATTR_REMOVE and that now failed and caused a bunch of errors in xfstests. The fix is to return the real xattr value size when getxattr is called with size == 0. To do that with the old format we'd have to iterate over all the items which happened to be pretty awkward in the current code paths. So we're taking this opportunity to land a change that had been brewing for a while. We now form the xattr keys from the hash of the name and the item values now store a logical contiquous header, the name, and the value. This makes it very easy for us to have the full xattr value length in the header and return it from getxattr when size == 0. Now all tests pass while honororing the XATTR_CREATE and XATTR_REMOVE flags. And the code is a whole lot easier to follow. And we've removed another barrier for moving to small fixed size keys. Signed-off-by: Zach Brown <zab@versity.com>	2018-03-15 09:23:57 -07:00
Zach Brown	302b0f5316	scoutfs: track inode 512b block count We weren't doing anything with the inode blocks field. We weren't even initializing it which explains why we'd sometimes see garbage i_blocks values in scoutfs inodes in segments. The logical blocks field reflects the contents of the file regardless of whether its online or not. It's the sum of our online and offline block tracking. So we can initialize it to our persistent online and offline counts and then keep it in sync as blocks are allocated and freed. Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	c1311783d5	scoutfs: add tracking of online and offline blocks Signed-off-by: Zach Brown <zab@versity.com>	2018-02-21 09:36:44 -08:00
Zach Brown	a49061a7d9	scoutfs: remove the size index We aren't using the size index. It has runtime and code maintenance costs that aren't worth paying. Let's remove it. Removing it from the format and no longer maintaining it are straight forward. The bulk of this patch is actually the act of removing it from the index locking functions. We no longer have to predict the size that will be stored during the transaction to lock the index items that will be created during the transaction. A bunch of code to predict the size and then pass it into locking and transactions goes away. Like other inode fields we now update the size as it changes. Signed-off-by: Zach Brown <zab@versity.com>	2018-01-30 15:03:35 -08:00
Mark Fasheh	ac09f03327	scoutfs: open by handle This is implemented by filling in our export ops functions. When we get those right, the VFS handles most of the details for us. Internally, scoutfs handles are two u64's (ino and parent ino) and a type which indicates whether the handle contains the parent ino or not. Surpisingly enough, no existing type matches this pattern so we use our own types to identify the handle. Most of the export ops are self explanatory scoutfs_encode_fh() takes an inode and an optional parent and encodes those into the smallest handle that would fit. scoutfs_fh_to_[dentry\|parent] turn an existing file handle into a dentry. scoutfs_get_parent() is a bit different and would be called on directory inodes to connect a disconnected dentry path. For scoutfs_get_parent(), we can export add_next_linkref() and use the backref mechanism to quickly find a parent directory. scoutfs_get_name() is almost identical to scoutfs_get_parent(). Here we're linking an inode to a name which exists in the parent directory. We can also use add_next_linkref, and simply copy the name from the backref. As a result of this patch we can also now export scoutfs file systems via NFS, however testing NFS thoroughly is outside the scope of this work so export support should be considered experimental at best. Signed-off-by: Mark Fasheh <mfasheh@versity.com> [zab edited <= NAME_MAX]	2018-01-26 11:59:47 -08:00
Zach Brown	cfe81354ee	scoutfs: remove SCOUTFS_LOCK_INODE_GROUP_OFFSET This is an unused artifact from a previous key format. Signed-off-by: Zach Brown <zab@versity.com>	2017-12-08 12:18:43 -06:00
Zach Brown	22911afc6e	scoutfs: remove btree item bit tracking The augmenting of the btree to track items with bits set was too fiddly for its own good. We were able to migrate old btree blocks with a simple stored key while also fixing livelocks as the parent and item bits got out of sync. This is now unused buggy code that can be removed. Signed-off-by: Zach Brown <zab@versity.com>	2017-10-26 14:47:59 -07:00

1 2 3

147 Commits