scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-08 11:40:08 +00:00

Author	SHA1	Message	Date
Zach Brown	d8bc962fc5	scoutfs: unpriv listxattr_hidden only shows .hide. Our hidden attributes are hidden so that they don't leak out of the system when archiving tools transfer xattrs from listxattr along with the file. They're not intended to be secret, in fact users want to see their contents like they want to see other fs metadata that they can't update which describes the system. Make our listxattr ioctl only return hidden xattrs and allow anyone to see the results if they can read the file. Rename it to more accurately describe its intended use. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-28 10:23:55 -07:00
Zach Brown	663ce53109	scoutfs: clean up _IO ioctl macro usage Accurately set the direction bits, pack down the used numbers, and remove stale old ioctl definitions. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-28 10:23:55 -07:00
Zach Brown	4a29cb5888	scoutfs: naturally align ioctl structs Order the ioctl struct field definitions and add padding so that runtimes with different word dizes don't add different padding. Userspace is spared having to deal with packing and we don't have to worry about compat translation in the kernel. We had two persistent structures that crossed the ioctl, a key and a timespec, so we explicitly translate to and from their persistent types in the ioctl. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-27 11:39:11 -07:00
Zach Brown	7dfbd3950f	scoutfs: add index of inodes by xattr names Add a .indx. xattr tag which adds the inode to an index of inodes keyed by the hash of xattr names. An ioctl is added which then returns all the inodes which may contain an xattr of the given name. Dropping all xattrs now has to parse the name to find out if it also has to delete an index item. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	aee017903b	scoutfs: add hash helper Add a quick header which calculates 64bit hashes by calculating the crc of the two halves of a byte region. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	a7fef3d7dd	scoutfs: add listxattr_raw ioctl Add an ioctl which can be used to iterate over the keys for all the xattrs on an inode. It is privileged, can see hidden inodes, and has an iteration cursor so that it can make its way through very large numbers of xattrs. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	019b5f6d6b	scoutfs: add scoutfs xattr prefix and name tags Add a scoutfs. xattr prefix which then defines a series of following tags which can change the behaviour of the xattr. We start with .hide. which stops the xattr from showing up in listxattr. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	a239f6093d	scoutfs: add mount_options/ sysfs dir Add a directory per mount that shows the values of all the mount options. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-05 14:30:11 -07:00
Zach Brown	7d56d8f34f	scoutfs: add .show_options Add the vfs callback that prints mount options in /proc files. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-05 14:30:11 -07:00
Zach Brown	c061ada671	scoutfs: mounts connect once server is listening An elected leader writes a quorum block showing that it's elected before it assumes exclusive access to the device and starts bringing up the server. This lets another later elected leader find and fence it if something happens. Other mounts were trying to connect to the server once this elected quorum block was written and before the server was listening. They'd get conection refused, decide to elect a new leader, and try to fence the server that's still running. Now, they should have tried much harder to connect to the elected leader instead of taking a single failed attempt as fatal. But that's a problem for another day that involves more work in balancing timeouts and retries. But mounts should not have tried try to connect to the server until its listening. That's easy to signal by adding a simple listening flag to the quorum block. Now mounts will only try to connect once they see the listening flag and don't see these racey refused connections. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 15:01:00 -07:00
Zach Brown	abd7ffc247	scoutfs: only trace read qourum blocks after io We have trace points as blocks are read, but the reads are cached as buffer heads. The iteration helpers are used to referenced cached blocks a few times in each voting cycle and we end up tracing cached read blocks multiple times. This uses a bit on the buffer_head to only trace a cached block the first time it's read. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 15:01:00 -07:00
Zach Brown	4df35efbc0	scoutfs: show quorum state in sysfs Add some sysfs files which show quorum state. We store the state in quorum_info off the super which is updates as we participate in elections. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:51:02 -07:00
Zach Brown	2cc4f89ad5	scoutfs: add sysfs attrs wrappers Add some helpers to manage the lifetime of groups of attributes in sysfs. We can wait until the sysfs files are no longer in use before tearing down the data that they rely on. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:51:02 -07:00
Zach Brown	c010afa8ff	scoutfs: add setattr_more ioctl Add an ioctl that can be used by userspace to restore a file to its offline state. To do that it needs to set inode fields that are otherwise not exposed and create an offline extent. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:45:52 -07:00
Zach Brown	0b6bc8789c	scoutfs: don't leak btree block refs Somewhere in the mists of time (around when we removed path tracking which held refs to blocks?) walking blocks to migrate started leaking btree block references. It was providing a pointer so the walk gave it the block it found but the caller was never dropping that ref. It wasn't doing anything with the result of the walk so we just don't provide a block pointer and the walk will drop the ref for us. This will stop leaking refs, effectively pinning the ring in memory. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	0988cbe1e9	scoutfs: track old and cur dirty btree blocks To avoid overwriting live btree blocks we have to migrate them between halves of the ring. Each time we cross into a new half of the ring we start migration all over again. The intent was to slowly migrate the blocks over time. We'd track dirty blocks that came from the old and current halves and keep them in balance. This would keep the overhead of the migration low and spread out through all at the start of the half that include migration. But the calculation of current blocks was completely wrong. It checked the newly allocated block which is always in the current half. It never thought it was dirtying old blocks so it'd constantly migrate trying to find them. We'd effectively migrate every btree block during the first transaction in each half. This calculates if we're dirtying old or new blocks by the source of the cow operation. We now recognize when we dirty old blocks and will stop migrating once we've migrated at least as many old blocks as we've written new blocks. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	e10033b34d	scoutfs: migrate dirty btree blocks during wrap We were seeing ring btree corruption that manifest as the server seeing stale btree blocks as it tried to read all the btrees to migrate blocks during a write. A block it tried to read didn't match its reference. It turned out that block wasn't being migrated. It would get stuck at a position in the ring. Eventually new block writes would overwrite it and then the next read would see corruption. It wasn't being migrated because the block reading function didn't realize that it had to migrate a dirty block. The block was written in a transaction at the end of the ring. The ring wrapped during the transaction and then migration tried to migrate the dirty block. It wouldn't be dirtied, and thus be migrated, because it was already dirty in the transaction. The fix is to add more cases to the dirtying decision which takes migration specifically into account. We'll no longer short circuit dirtying blocks for migration when they're in the old half of the ring even though they're dirty. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	e150ebc8d2	scoutfs: trace btree dirty blocks Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	806ac0d8e6	scoutfs: fix mkfs option in README Fix a quick option typo in the mkfs invocations in the readme. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	cfa563a4a4	scoutfs: expand the per_task API This adds some minor functionality to the per_task API for use by the upcoming offline waiting work. Add scoutfs_per_task_add_excl() so that a caller can tell if their task was already put on a per-task list by their caller. Make scoutfs_per_task_del() return a bool to indicate if the entry was found on a list and was in fact deleted, or not. Add scoutfs_per_task_init_entry() for initializing entries that aren't declared on the stack. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	3a6392aee6	scoutfs: remove scoutfs_unlock_flags() prototype There was an old prototype for an unlock variant that hasn't been around for a while. Remove it. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	7097d545cf	scoutfs: make sure to set the sb blocksize Since fill_super was originally written we've added use of buffer_head IO by the btree and quorum voting. We forgot to set the block size so devices that didn't have the common 4k default, matching our block size, would see errors. Explicitly set it. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	6342bd5679	scoutfs: update README.md for quorum Update the github README to describe the recent addition of integrated quorum voting and locking. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	b5133bfc98	scoutfs: add elected flag to quorum block It was a mistake to use a non-zero elected_nr as the indication that a slot is considered actively elected. Zeroing it as the server shuts down wipes the elected_nr and means that it doesn't advance as each server is elected. This then causes a client connecting to a new server to be confused for a client reconnecting to a server after the server has timed it out and destroyed its state. This caused reconnection after shutting down a server to fail and clients to loop reconnecting indefinitely. This instead adds flags to the quorum block and assigns a flag to indicate that the slot should be considered active. It's cleared by fencing and by the client as the server shuts down. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	fe63b566c9	scoutfs: use _unaligned instead of __packed We were relying on a cute (and probably broken) trick of defining pointers to unaligned base types with __packed. Modern versions of gcc warn about this. Instead we either directly access unaligned types with get_ and put_unaligned, or we copy unaligned data into aligned copies before working with it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e88b5732ad	scoutfs: track trans seq in btree Currently the server tracks the outstanding transaction sequence numbers that clients have open in a simple list in memory. It's not properly cleaned up if a client unmounts and a new server that takes over after a crash won't know about open transaction sequence numbers. This stores open transaction sequence numbers in a shared persistent btree instead of in memory. It removes tracking for clients as they send their farewell during unmount. A new server that starts up will see existing entries for clients that were created by old servers. This fixes a bug where a client who unmounts could leave behind a pending sequence number that would never be cleaned up and would indefinitely limit the visibility of index items that came after it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	3d82dd3a46	scoutfs: fix bad octet in tracing ipv4 address The macro for producing trace args for an ipv4 address had a typo when shifting the third octet down before masking. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	fa3e0a31c7	scoutfs: use SO_REUSEADDR for server socket The server's listening address is fixed by the raft config in the super block. If it shuts down and rapidly starts back up it needs to bind to the currently lingering address. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	0bc0ff9300	scoutfs: add clock sync trace events Generate unique trace events on the send and recv side of each message sent between nodes. This can be used to reasonbly efficiently synchronize the monotonic clock in trace events between nodes given only their captured trace events. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	a546bd0aab	scoutfs: check for newlines in msg.h wrappers The message formatter adds a newline so callers don't have to. But sometimes they do and we get double newlines. Add a build check that the format string doesn't end in a newline so that we stop adding these. And fix up all the current offenders. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	801f6ad9be	scoutfs: add scoutfs_spbm_empty() Add a quick function that determines if a sparse bitmap has no bits set. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	20f4e1c338	scoutfs: put magic value in block header The super block had a magic value that was used to identify that the block should contain our data structure. But it was called an 'id' which was confused with the header fsid in the past. Also, the btree blocks aren't using a similar magic value at all. This moves the magic value in to the header and creates values for the super block and btree blocks. Both are written but the btree block reads don't check the value. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	675275fbf1	scoutfs: use hdr.fsid in greeting instead of id The network greeting exchange was mistakenly using the global super block magic number instead of the per-volume fsid to identify the volumes that the endpoints are working with. This prevented the check from doing its only job: to fail when clients in one volume try to connect to a server in another. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	288d781645	scoutfs: start and stop server with quorum Currently all mounts try to get a dlm lock which gives them exclusive access to become the server for the filesystem. That isn't going to work if we're moving to locking provided by the server. This uses quorum election to determine who should run the server. We switch from long running server work blocked trying to get a lock to calls which start and stop the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	08a140c8b0	scoutfs: use our locking service Convert client locking to call the server's lock service instead of using a fs/dlm lockspace. The client code gets some shims to send and receive lock messages to and from the server. Callers use our lock mode constants instead of the DLM's. Locks are now identified by their starting key instead of an additional scoped lock name so that we don't have more mapping structures to track. The global rename lock uses keys that are defined by the format as only used for locking. The biggest change is in the client lock state machine. Instead of calling the dlm and getting callbacks we send messages to our server and get called from incoming message processing. We don't have everything come through a per-lock work queue. Instead we send requests either from the blocking lock caller or from a shrink work queue. Incoming messages are called in the net layer's blocking work contexts so we don't need to do any more work to defer to other contexts. The different processing contexts leads to a slightly different lock life cycle. We refactor and seperate allocation and freeing from tracking and removing locks in data structures. We add a _get and _put to track active use of locks and then async references to locks by holders and requests are tracked seperately. Our lock service's rules are a bit simpler in that we'll only ever send one request at a time and the server will only ever send one request at a time. We do have to do a bit of work to make sure we process back to back grant reponses and invalidation requests from the server. As of this change the lock setup and destruction paths are a little wobbly. They'll be shored up as we add lock recovery between the client and server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	7c8383eddd	scoutfs: add scoutfs_lock_rename() Add a specific lock method for locking the global rename lock instead of having the caller specify it as a global lock. We're getting rid of the notion of lock scopes and requiring all locks to be related to keys. The rename lock will use magic keys at the end of the volume. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	34b8950bca	scoutfs: initial lock server core Add the core lock server code for providing a lock service from our server. The lock messages are wired up but nothing calls them. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	f472c0bc87	scoutfs: add scoutfs_net_response_node() Today all responses can only be sent down the connection that sent the response while the request is being processed. We'll be adding subsystems that need to send responses asynchronously after initial request processing. Give them a call to send a response to a node id instead of to a node's connection. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	c34dd452a7	scoutfs: add quorum voting Add a quorum election implementation. The mounts that can participate in the election are specified in a quorum config array in the super block. Each configured participant is assigned a preallocated block that it can write to. All mounts read the quorum blocks to find the member who was elected the leader and should be running the server. The voting mounts loop reading voting blocks and writing their vote block until someone is elected with a amjority. Nothing calls this code yet, this adds the initial implementation and format. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	d57b8232ee	scoutfs: move base types in format.h We had scattered some base types throughout the format file which made them annoying to reference in higher level structs. Let's put them at the top so we can use them without declarations or moving things around in unrelated commits. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	f75e1e1322	scoutfs: reformat Makefile to one object per line Reformat the scoutfs-y object list so that there's one object per line. Diffs now clearly demonstrate what is changing instead of having word wrapping constantly obscuring changes in the built objects. (Did everyone spot the scoutfs_trace sorting mistake? Another reason not to mash everything into wrapped lines :)). Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	6caa87458b	scoutfs: add scoutfs_net_client_node_id() Some upcoming network request processing paths need access to the connected client's node_id. We could add it to the arguments but that'd be a lot of churn so we'll add an accessor function for now. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e9f6e79d67	scoutfs: add uniq_name mount option Each mount is getting a specified unique name. This can be used to identify a reconnecting mount that indicates that an old instance of the same unique name can no longer exist and doesn't need to be fenced. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	8fedfef1cc	scoutfs: remove stale net response data comment There was a time when responding with an error wouldn't include the caller's data payload. That hasn't been the case since we added compaction network requests which include a reference to the compaction operation with the error response. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	91d190622d	scoutfs: remove scoutfs.md file The current plan is to maintain a nice paper describing the system in the scoutfs-utils repository. Signed-off-by: Zach Brown <zab@versity.com>	2018-09-25 13:02:11 -07:00
Brandon Philips	9bb0c60c63	README: add whitepaper link The white paper is helpful and not linked from the Github README which will be a primary landing spot for folks discovering the project.	2018-09-19 11:03:11 -07:00

1 2 3 4 5 ...

722 Commits