scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-06-08 12:42:35 +00:00

Author	SHA1	Message	Date
Zach Brown	97f3971dcd	scoutfs: add rid sysfs file Add a "rid" file along the "fsid" file in the per-mount sysfs dir that gives the mounts rid. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	ab7bde9e2c	scoutfs: replace node_id with rid in networking Use the client's rid in networking instead of the node_id. The node_id no longer has to be allocated by the server and sent in the greeting. Instead the client sends it to the server in its greeting. The server then uses the client's announced rid just like it used to use the its node_id. It's used to record clients in the btree and to identify clients in sending and receive processing. The use of the rid in networking calls makes its way to locking and compaction which now use the rid to identify clients intead of the node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	f0a86f05f8	scoutfs: remove unused uniq_name option Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5b258cee3b	scoutfs: refine quorum voting The current quorum voting implementatoin had some rough edges that increased the complexity of the system and introduced undesirable failure modes. We can keep the same basic pattern but move functionality around a few places, and rethink the quorum voting, to end up with a meaningfully simpler system. The motivation for this work was to remove the need to provide a uniq_name option for every mount instance. The first big change is to remove the idea of static configuration slots for mounts. This removes the use of uniq_name. Mounts now simply have a server_addr mount option instead of using their uniq_name to find their address in the configuration. The server can't check the configuration to see if a given connected client's name is found in the quorum config. Clients can set a flag in their sent greeting which indicates that they're a voter. This removes the uniq_name from the greeting and mounted client records. Without a static configuration mounts no longer have dedicated block locations to write to. We increase the size of the region of quorum blocks and have voters simply write to a random block. Overwriting vote blocks is OK because we move from heartbeating design patterns to a protocol strongly based on raft's election. We're using quorum blocks to communicate votes instead of network messages and overwriting blocks is analagous to lossy networks droping vote messages in the raft election protocol. We were using the dedicated per-mount quorum blocks to track mounts that had been elected and needed to be fenced. We no longer have that storage so instead we add the idea of an election log that is stored in every voting block. Readers merge the logs from all the blocks they read and write the resulting merged log in their block. With no static quorum configuration we no longer have to worry about the complexity of changing the slot configurations while they're in use. The only persistent configuration is the number of votes a candidate needs to be elected by a quorum. It was a mistake to use quorum voting blocks to communicate state between the server and the quorum voters. We can easily move the unmount_barrier, server address, and fencing state from the quorum blocks into the super block. The server no longer needs the quorum election info struct to be able to later write its quorum block. It instead writes a few fields in the super. There's only one place where clients need to look to find out who they should connect to or if they can finish unmount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5929a36747	scoutfs: add server_addr mount option Add a server_addr mount option that takes an ipv4 address. This will be used by the upcoming changes to quorum voting to indicate that a mount should participate in voting and to specify the address that its server should listen on. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	532256271c	scoutfs: simplify scoutfs_write_super() The pattern of advancing and writing a "dirty super" comes from the time when the format had two persistent super blocks. One was kept in memory and modified as changes were made. Advancing it changed which of the two supers would be eventually written. This no longer makes sense now that we only have one super block. Remove the idea of advancing and writing an implicit dirty super block that's stored in the super block info. Instead use a single scoutfs_write_super() which takes the super block struct to write. We still store and increment the hdr.gen in the super block. It used to be used to tell which of the two super blocks are more recent, now it is just some information that can tell us something about the life of the super block. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	1ea75a9d54	scoutfs: add scoutfs_addr sin conversion functions Add some quick functions that let us convert between our persistent packed inet addr struct and native sockaddr_in structs. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	6f5cfd8cc2	scoutfs: use rid instead of node_id in items Use the mount's generated random id in persistent items and the lock that protects them instead of the assigned node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	bd7a7fe97e	scoutfs: use fr identity in pseudo fs paths Use the fr mount identity string in the sysfs/fs/ and debugfs paths we register for each mount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	7a36a289d2	scoutfs: add rid to trace messages Add the mount rid to traces which included the fsid by converting them to use the super block message format and args. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	754ce95f5c	scoutfs: use rid in console messages Change the console message output to show the fsid:rid mount identity instead of the block device name and device major and minor numbers. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	7acbf4cc8b	scoutfs: add super block format and args Add macros which provide printk format and args for a little string which identifies a specific mount. This will be used in kernel logs and trace messages. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	b147d74967	scoutfs: add per-mount random id Calculate a random id which identifies the life of a particular mount. This will be visible in messages and tracing and will replace the server-assigned node_id in persistent structures and protocols. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	d8bc962fc5	scoutfs: unpriv listxattr_hidden only shows .hide. Our hidden attributes are hidden so that they don't leak out of the system when archiving tools transfer xattrs from listxattr along with the file. They're not intended to be secret, in fact users want to see their contents like they want to see other fs metadata that they can't update which describes the system. Make our listxattr ioctl only return hidden xattrs and allow anyone to see the results if they can read the file. Rename it to more accurately describe its intended use. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-28 10:23:55 -07:00
Zach Brown	663ce53109	scoutfs: clean up _IO ioctl macro usage Accurately set the direction bits, pack down the used numbers, and remove stale old ioctl definitions. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-28 10:23:55 -07:00
Zach Brown	4a29cb5888	scoutfs: naturally align ioctl structs Order the ioctl struct field definitions and add padding so that runtimes with different word dizes don't add different padding. Userspace is spared having to deal with packing and we don't have to worry about compat translation in the kernel. We had two persistent structures that crossed the ioctl, a key and a timespec, so we explicitly translate to and from their persistent types in the ioctl. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-27 11:39:11 -07:00
Zach Brown	7dfbd3950f	scoutfs: add index of inodes by xattr names Add a .indx. xattr tag which adds the inode to an index of inodes keyed by the hash of xattr names. An ioctl is added which then returns all the inodes which may contain an xattr of the given name. Dropping all xattrs now has to parse the name to find out if it also has to delete an index item. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	aee017903b	scoutfs: add hash helper Add a quick header which calculates 64bit hashes by calculating the crc of the two halves of a byte region. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	a7fef3d7dd	scoutfs: add listxattr_raw ioctl Add an ioctl which can be used to iterate over the keys for all the xattrs on an inode. It is privileged, can see hidden inodes, and has an iteration cursor so that it can make its way through very large numbers of xattrs. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	019b5f6d6b	scoutfs: add scoutfs xattr prefix and name tags Add a scoutfs. xattr prefix which then defines a series of following tags which can change the behaviour of the xattr. We start with .hide. which stops the xattr from showing up in listxattr. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-24 09:58:22 -07:00
Zach Brown	a239f6093d	scoutfs: add mount_options/ sysfs dir Add a directory per mount that shows the values of all the mount options. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-05 14:30:11 -07:00
Zach Brown	7d56d8f34f	scoutfs: add .show_options Add the vfs callback that prints mount options in /proc files. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-05 14:30:11 -07:00
Zach Brown	c061ada671	scoutfs: mounts connect once server is listening An elected leader writes a quorum block showing that it's elected before it assumes exclusive access to the device and starts bringing up the server. This lets another later elected leader find and fence it if something happens. Other mounts were trying to connect to the server once this elected quorum block was written and before the server was listening. They'd get conection refused, decide to elect a new leader, and try to fence the server that's still running. Now, they should have tried much harder to connect to the elected leader instead of taking a single failed attempt as fatal. But that's a problem for another day that involves more work in balancing timeouts and retries. But mounts should not have tried try to connect to the server until its listening. That's easy to signal by adding a simple listening flag to the quorum block. Now mounts will only try to connect once they see the listening flag and don't see these racey refused connections. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 15:01:00 -07:00
Zach Brown	abd7ffc247	scoutfs: only trace read qourum blocks after io We have trace points as blocks are read, but the reads are cached as buffer heads. The iteration helpers are used to referenced cached blocks a few times in each voting cycle and we end up tracing cached read blocks multiple times. This uses a bit on the buffer_head to only trace a cached block the first time it's read. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 15:01:00 -07:00
Zach Brown	4df35efbc0	scoutfs: show quorum state in sysfs Add some sysfs files which show quorum state. We store the state in quorum_info off the super which is updates as we participate in elections. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:51:02 -07:00
Zach Brown	2cc4f89ad5	scoutfs: add sysfs attrs wrappers Add some helpers to manage the lifetime of groups of attributes in sysfs. We can wait until the sysfs files are no longer in use before tearing down the data that they rely on. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:51:02 -07:00
Zach Brown	c010afa8ff	scoutfs: add setattr_more ioctl Add an ioctl that can be used by userspace to restore a file to its offline state. To do that it needs to set inode fields that are otherwise not exposed and create an offline extent. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-30 13:45:52 -07:00
Zach Brown	0b6bc8789c	scoutfs: don't leak btree block refs Somewhere in the mists of time (around when we removed path tracking which held refs to blocks?) walking blocks to migrate started leaking btree block references. It was providing a pointer so the walk gave it the block it found but the caller was never dropping that ref. It wasn't doing anything with the result of the walk so we just don't provide a block pointer and the walk will drop the ref for us. This will stop leaking refs, effectively pinning the ring in memory. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	0988cbe1e9	scoutfs: track old and cur dirty btree blocks To avoid overwriting live btree blocks we have to migrate them between halves of the ring. Each time we cross into a new half of the ring we start migration all over again. The intent was to slowly migrate the blocks over time. We'd track dirty blocks that came from the old and current halves and keep them in balance. This would keep the overhead of the migration low and spread out through all at the start of the half that include migration. But the calculation of current blocks was completely wrong. It checked the newly allocated block which is always in the current half. It never thought it was dirtying old blocks so it'd constantly migrate trying to find them. We'd effectively migrate every btree block during the first transaction in each half. This calculates if we're dirtying old or new blocks by the source of the cow operation. We now recognize when we dirty old blocks and will stop migrating once we've migrated at least as many old blocks as we've written new blocks. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	e10033b34d	scoutfs: migrate dirty btree blocks during wrap We were seeing ring btree corruption that manifest as the server seeing stale btree blocks as it tried to read all the btrees to migrate blocks during a write. A block it tried to read didn't match its reference. It turned out that block wasn't being migrated. It would get stuck at a position in the ring. Eventually new block writes would overwrite it and then the next read would see corruption. It wasn't being migrated because the block reading function didn't realize that it had to migrate a dirty block. The block was written in a transaction at the end of the ring. The ring wrapped during the transaction and then migration tried to migrate the dirty block. It wouldn't be dirtied, and thus be migrated, because it was already dirty in the transaction. The fix is to add more cases to the dirtying decision which takes migration specifically into account. We'll no longer short circuit dirtying blocks for migration when they're in the old half of the ring even though they're dirty. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	e150ebc8d2	scoutfs: trace btree dirty blocks Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:41:07 -07:00
Zach Brown	806ac0d8e6	scoutfs: fix mkfs option in README Fix a quick option typo in the mkfs invocations in the readme. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	a6782fc03f	scoutfs: add data waiting One of the core features of scoutfs is the ability to transparently migrate file contents to and from an archive tier. For this to be transparent we need file system operations to trigger staging the file contents back into the file system as needed. This adds the infrastructure which operations use to wait for offline extents to come online and which provides userspace with a list of blocks that the operations are waiting for. We add some waiting infrastructure that callers use to lock, check for offline extents, and unlock and wait before checking again to see if they're still offline. We add these checks and waiting to data io operations that could encounter offline extents. This has to be done carefully so that we don't wait while holding locks that would prevent staging. We use per-task structures to discover when we are the first user of a cluster lock on an inode, indicating that it's safe for us to wait because we don't hold any locks. And while we're waiting our operation is tracked and reported to userspace through an ioctl. This is a non-blocking ioctl, it's up to userspace to decide how often to check and how large a region to stage. Waiters are woken up when the file contents could have changed, not specifically when we know that the extent has come online. This lets us wake waiters when their lock is revoked so that they can block waiting to reacquire the lock and test the extents again. It lets us provide coherent demand staging across the cluster without fine grained waiting protocols sent betwen the nodes. It may result in some spurious wakeups and work but hopefully it won't, and it's a very simple and functional first pass. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	cfa563a4a4	scoutfs: expand the per_task API This adds some minor functionality to the per_task API for use by the upcoming offline waiting work. Add scoutfs_per_task_add_excl() so that a caller can tell if their task was already put on a per-task list by their caller. Make scoutfs_per_task_del() return a bool to indicate if the entry was found on a list and was in fact deleted, or not. Add scoutfs_per_task_init_entry() for initializing entries that aren't declared on the stack. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	3a6392aee6	scoutfs: remove scoutfs_unlock_flags() prototype There was an old prototype for an unlock variant that hasn't been around for a while. Remove it. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	7097d545cf	scoutfs: make sure to set the sb blocksize Since fill_super was originally written we've added use of buffer_head IO by the btree and quorum voting. We forgot to set the block size so devices that didn't have the common 4k default, matching our block size, would see errors. Explicitly set it. Signed-off-by: Zach Brown <zab@versity.com>	2019-05-21 11:33:26 -07:00
Zach Brown	6342bd5679	scoutfs: update README.md for quorum Update the github README to describe the recent addition of integrated quorum voting and locking. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	b5133bfc98	scoutfs: add elected flag to quorum block It was a mistake to use a non-zero elected_nr as the indication that a slot is considered actively elected. Zeroing it as the server shuts down wipes the elected_nr and means that it doesn't advance as each server is elected. This then causes a client connecting to a new server to be confused for a client reconnecting to a server after the server has timed it out and destroyed its state. This caused reconnection after shutting down a server to fail and clients to loop reconnecting indefinitely. This instead adds flags to the quorum block and assigns a flag to indicate that the slot should be considered active. It's cleared by fencing and by the client as the server shuts down. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	36b0df336b	scoutfs: add unmount barrier Now that a mount's client is responsible for electing and starting a server we need to be careful about coordinating unmount. We can't let unmounting clients leave the remaining mounted clients without quorum. The server carefully tracks who is mounted and who is unmounting while it is processing farewell requests. It only sends responses to voting mounts while quorum remains or once all the voting clients are all trying to unmount. We use a field in the quorum blocks to communicate to the final set of unmounting voters that their farewells have been processed and that they can finish unmounting without trying to restablish quorum. The commit introduces and maintains the unmount_barrier field in the quorum blocks. It is passed to the server from the election, the server sends it to the client and writes new versions, and the client compares what it received with what it sees in quorum blocks. The commit then has the clients send their unique name to the server who stores it in persistent mounted client records and compares the names to the quorum config when deciding which farewell reqeusts can be responded to. Now that farewell response processing can block for a very long time it is moved off into async work so that it doesn't prevent net connections from being shutdown and re-established. This also makes it easier to make global decisions based on the count of pending farewell requests. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	fe63b566c9	scoutfs: use _unaligned instead of __packed We were relying on a cute (and probably broken) trick of defining pointers to unaligned base types with __packed. Modern versions of gcc warn about this. Instead we either directly access unaligned types with get_ and put_unaligned, or we copy unaligned data into aligned copies before working with it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	e88b5732ad	scoutfs: track trans seq in btree Currently the server tracks the outstanding transaction sequence numbers that clients have open in a simple list in memory. It's not properly cleaned up if a client unmounts and a new server that takes over after a crash won't know about open transaction sequence numbers. This stores open transaction sequence numbers in a shared persistent btree instead of in memory. It removes tracking for clients as they send their farewell during unmount. A new server that starts up will see existing entries for clients that were created by old servers. This fixes a bug where a client who unmounts could leave behind a pending sequence number that would never be cleaned up and would indefinitely limit the visibility of index items that came after it. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	3d82dd3a46	scoutfs: fix bad octet in tracing ipv4 address The macro for producing trace args for an ipv4 address had a typo when shifting the third octet down before masking. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	fa3e0a31c7	scoutfs: use SO_REUSEADDR for server socket The server's listening address is fixed by the raft config in the super block. If it shuts down and rapidly starts back up it needs to bind to the currently lingering address. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	0bc0ff9300	scoutfs: add clock sync trace events Generate unique trace events on the send and recv side of each message sent between nodes. This can be used to reasonbly efficiently synchronize the monotonic clock in trace events between nodes given only their captured trace events. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	a546bd0aab	scoutfs: check for newlines in msg.h wrappers The message formatter adds a newline so callers don't have to. But sometimes they do and we get double newlines. Add a build check that the format string doesn't end in a newline so that we stop adding these. And fix up all the current offenders. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	ec0fb5380a	scoutfs: implement lock recovery When a server crashes all the connected clients still have operational locks and can be using them to protect IO. As a new server starts up its lock service needs to account for those outstanding locks before granting new locks to clients. This implements lock recovery by having the lock service recover locks from clients as it starts up. First the lock service stores records of connected clients in a btree off the super block. Records are added as the server receives their greeting and are removed as the server receives their farewell. Then the server checks for existing persistent records as it starts up. If it finds any it enters recovery and waits for all the old clients to reconnect before resuming normal processing. We add lock recover request and response messages that are used to communicate locks from the clients to the server. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	801f6ad9be	scoutfs: add scoutfs_spbm_empty() Add a quick function that determines if a sparse bitmap has no bits set. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	74366f0df1	scoutfs: make networking more reliable The current networking code has loose reliability guarantees. If a connection between the client and server is broken then the client reconnects as though its an entirely new connection. The client resends requests but no responses are resent. A client's requests could be processed twice on the same server. The server throws away disconnected client state. This was fine, sort of, for the simple requests we had implemented so far. It's not good enough for the locking service which would prefer to let networking worry about reliable message delivery so it doesn't have to track and replay partial state across reconnection between the same client and server. This adds the infrastructure to ensure that requests and responses between a given client and server will be delivered across reconnected sockets and will only be processed once. The server keeps track of disconnected clients and restores state if the same client reconnects. This required some work around the greetings so that clients and servers can recognize each other. Now that the server remembers disconnected clients we add a farewell request so that servers can forget about clients that are shutting down and won't be reconnecting. Now that connections between the client and server are preserved we can resend responses across reconnection. We add outgoing message sequence numbers which are used to drop duplicates and communicate the received sequence back to the sender to free responses once they're received. When the client is reconnecting to a new server it resets its receive state that was dependent on the old server and it drops responses which were being sent to a server instance which no longer exists. This stronger reliable messaging guarantee will make it much easier to implement lock recovery which can now rewind state relative to requests that are in flight and replay existing state on a new server instance. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	20f4e1c338	scoutfs: put magic value in block header The super block had a magic value that was used to identify that the block should contain our data structure. But it was called an 'id' which was confused with the header fsid in the past. Also, the btree blocks aren't using a similar magic value at all. This moves the magic value in to the header and creates values for the super block and btree blocks. Both are written but the btree block reads don't check the value. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00
Zach Brown	675275fbf1	scoutfs: use hdr.fsid in greeting instead of id The network greeting exchange was mistakenly using the global super block magic number instead of the per-volume fsid to identify the volumes that the endpoints are working with. This prevented the check from doing its only job: to fail when clients in one volume try to connect to a server in another. Signed-off-by: Zach Brown <zab@versity.com>	2019-04-12 10:54:07 -07:00

1 2 3 4 5 ...

735 Commits