scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-02-12 13:31:12 +00:00

Author	SHA1	Message	Date
Zach Brown	3978bbd23f	scoutfs: have xattr use max val size The xattr code had a static defintion of the largest part item that it would create. Change it to be a function of the largest fs item value that can be created and clean up the code a bit in the process. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	55fa73f407	scoutfs: add packed extent and bitmap tracing Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	0de6cade19	scoutfs: remove generic extents storage We are no longer storing individual extents in items from multiple places and indexed in multiple ways. We can remove this extent support code. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	dee9fbcf66	scoutfs: use packed extents and bitmaps The btree forest item storage doesn't have as much item granular state as the item cache did. The item cache could tell if a cached item was populated from persistent storage or was created in memory. It could simply remove created items rather than leaving behind a deletion item. The cached btree blocks in the btree forest item storage mechanism can't do this. It has to create deletion items when deleting newly created items because it doesn't know if the item already exists in the persistent record or not. This created a problem with the extent storage we were using. The individual extent items were stored with a key set to the last logical block of their extent. As extents grew or shrank they often were deleted and created at different key values during a transaction. In the btree forest log trees this left a huge stream of deletion items beind, one for every previous version of the extent. Then searches for an extent covering a block would have to skip over all these deleted items before hitting the current stored extent. Streaming writes would operate on O(n) for every extent operation. It got to be out of hand. This large change solves the problem by using more coarse and stable item storage to track free blocks and blocks mapped into file data. For file data we now have large packed extent items which store packed representations of all the logical mappings of a fixed region of a file. The data code has loading and storage functions which transfer that persistent version to and from the version that is modified in memory. Free blocks are stored in bitmaps that are similarly efficiently packed into fixed size items. The client is no longer working with free extent items managed by the forest, it's working with free block bitmap btrees directly. It needs access to the client's metadata block allocator and block write contexts so we move those two out of the forest code and up into the transaction. Previously the client and server would exchange extents with network messages. Now the roots of the btrees that store the free block bitmap items are communicated along with the roots of the other trees involved in a transaction. The client doesn't need to send free extents back to the server so we can remove those tasks and rpcs. The server no longer has to manage free extents. It transfers block bitmap items between trees around commits. All of its extent manipulation can be removed. The item size portion of transaction item counts are removed because we're not using that level of granularity now that metadata transactions are dirty btree blocks instead of dirty items we pack into fixed sized segments. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	986e66d6c6	scoutfs: add block tracing Add tracing of operations on our block cache. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	388175fc6a	scoutfs: add forest tracing Add some tracing events to the forest subsystem. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	fbffad1d51	scoutfs: add initial lock write_version We need a way to compare two items in different log btrees and learn which is the most recent. Each time we grant a new write lock we give it a larger write version. Items store the version of the lock they're written under. Readers can now easily see which item is newer. This is a trivial initial implementation which is not consistent across unmount or server failover. We'll need to recover the greatest write_version from locks during recovery and from log trees as the server starts up. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	43d416003a	scoutfs: add scoutfs_btree_force Add a btree_update variant which will insert the item if a previous wasn't found instead of returning -ENOENT. This saves callers from having to lookup befure updating to discover if they should call _create or _update. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	2fab3b4377	scoutfs: allow larger 8MB transactions Try using larger transactions. This will probably be tweaked over time. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	edd8fe075c	scoutfs: remove lsm code Remove all the now unused code that deals with lsm: segment IO, the item cache, and the manifest. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	43f451d015	scoutfs: read and write super with buffer_head Use simple buffer_heads to read and write the super. After getting rid of the lsm code this would be the last user of our bio helpers. With this converted we can remove the bio helpers along with the rest of the lsm code. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	58f062a2c1	scoutfs: use forest in locking and transaction Transaction commit now has to ask the forest to write the btrees during a transaction commit instead of writing dirty items in segments. It also determines if holds fit in the dirty transaction by looking at dirty btree blocks instead of item counts. Locking no longer has to invalidate a private item cache because the forest paths use the btree block cache where inconsistency is discovered and invalidated as blocks are read. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	48448d3926	scoutfs: convert fs callers to forest Convert fs callers to work with the btree forest calls instead of the lsm item cache calls. This is mostly a mechanical syntax conversion. The inode dirtying path does now update the item rather than simply dirtying it. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	858dad1d51	scoutfs: add forest subsystem The forest code presents a consistent item interface that's implemented on top of a forest of persistent btrees. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	e6af174c79	scoutfs: add commit btree net command Add a simple start of a command that the client will use to commit its dirty trees. This'll be expanded in the future to include more trees and block allocation. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	0f83dfd512	scoutfs: update block btree interfaces in server Teach the server to maintain and use its block allocator and writer contexts when operating on its btrees. The manifest tree operations aren't updated because they're about to be removed. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	8775826d7e	scoutfs: have btree use blocks, allocator, writer Convert the btree to use our block cache, block allocation, and the caller's explicit dirty block tracking writer context instead of the ring. This is in preparation for the btree forest format where there are concurrent multiple writers of independent dynamically sized btrees instead of only the server writing one btree with a fixed maximum size. All the machinery for tracking dirty blocks in the ring and migrating is no longer needed. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	bdafa6ede6	scoutfs: add block allocator Add our block allocator core. It'll be used shortly. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	d20c950c17	scoutfs: restore our block cache Previous versions of the system had a simple block cache. This brings it back with support for blocks that are larger than page size, a more efficient LRU, and an explicit writer context. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	9456eda583	scoutfs: support larger btree block sizes The btree block header had some aggressively small values that limited the largest block size that could be supported. Use larger 32bit values so that we can support larger block sizes. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	e444c2b8c2	scoutfs: remove sort_priv The only user was item compaction in the btree and it has been removed. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	42b311c5be	scoutfs: memmove deleted btree items It turns out that the sorting performed by btree block item compaction was pretty expensive. It's cheaper to keep the items packed at the end of the block by moving earlier items towards the back of the block as interior items are deleted. When the items are always packed at the end of the block we no longer need to track fragmented free space and can remove the 'free_reclaim' btree block field. This brought the bulk empty file create rate up by about 20%. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	f3a8a5110e	scoutfs: allow btree update with different lengths The previous _btree_update implementation required that the new value be the same length as the old value. This change allows a new updated item to be a different length. It performs the btree walk assuming that the item will be larger so that there's room for the difference. It doesn't search for the size of the existing item so it doesn't know if the new item is smaller. It might leave the dirty leaf under the low water mark, which is fine. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	4265ecedb0	scoutfs: increase max btree value length Now that we're storing fs items in the btree we need a larger max value length. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	ac2d00629c	scoutfs: add scoutfs_lock_protected() The item code had a manual comparison of lock modes when testing if a given access was protected by a held lock. Let's offer a proper interface from the lock code. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-17 11:21:36 -08:00
Zach Brown	ddd1a4ef5a	scoutfs: support newer ->iterate readdir The modern upstream kernel has a ->iterate() readdir file_operattions method which takes a context and calls dir_emit(). We add some kernelcompat helpers to juggle the various function definitions, types, and arguments to support both the old ->readdir(filldir) and the new ->iterate(ctx) interfaces. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-15 14:57:57 -08:00
Zach Brown	2a6d209854	scoutfs: add kernelcompat files Add files that we'll use to detect and work around incompatibilities between kernel versions. Signed-off-by: Zach Brown <zab@versity.com>	2020-01-15 14:57:57 -08:00
Zach Brown	15becd6ef8	scoutfs: force locks idle as shutdown frees Usually lock_free() is called as users finish using a lock and when its state shows that it is idle and won't be freed out from under another use. During shutdown we manually call lock_free() on all locks because shutdown promises that there will be no more lock users, including networking callbacks. But there is a case where network requests can be pending and we shutdown before waiting for their reply. This trips BUG_ON assertions in lock_free() that would otherwise catch unsafe calls of lock_free(). This is easiest to reproduce by interrupting a mount (which is waiting on a lock to read the root inode). The fix is to update each lock's state during shutdown to reflect the promise made by shutdown. Requests aren't actually pending because we've shutdown networking befrore getting here. Signed-off-by: Zach Brown <zab@versity.com>	2019-09-10 09:57:37 -07:00
Wang Shilong	8c631b019b	scoutfs: fix wrong option example in README scoutfs f.000000.r.200d94 error: Unknown or malformed option, "server_address=192.168.31.220" Should be server_addr, fix it. Cc: Zach Brown <zab@versity.com> Fixes: 10c32("scoutfs: update README.md for server_addr") Signed-off-by: Wang Shilong <wangshilong1991@gmail.com>	2019-09-10 09:57:37 -07:00
Zach Brown	b1cc8b1a59	scoutfs: update README.md for server_addr Update the instructions for starting up a system with the quorum count mkfs option and server_addr mount option. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	15a492fe57	scoutfs: always dirty parents when migrating In a previous commit ("1bd094f scoutfs: migrate dirty btree blocks during wrap") we fixed a bug where we wouldn't migrate blocks from the old half of the ring because they were already dirty in memory. The fix accidentally introduced the case where we wouldn't dirty blocks when migrating if they were already in the current half. We always have to dirty parent blocks when migrating because we might need to modify them to reference the new location of child blocks that are migrated. This bug meant that we'd modify clean blocks in memory which would never make it to the persistent copy. The system could survive as long as it never read that block back from its persistent location. To see the corruption you'd either need tall btrees to be shared between mounts or you'd need one mount to evict its clean (actually modified) cached btree block under memory pressure and then try to read it back. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	07c9edb58f	scoutfs: warn on compaction stale seg reads It's possible to trigger stale segment reads during compaction. This shouldn't be possible during regular operation because the server protects the input segments while the compaction is pending. Stale segment reads can only happen to client reads which aren't serialized with segment allocation and writes. Warn if we see a stale segment read during compaction. It means that we either have a bug in the server or someone armed a stale segment read trigger that hit compaction. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	feaf17c3a5	scoutfs: add conn destroy workq Lockdep gets angry when we try to destroy an accepted conn workqueue from within work in a listening conn's workqueue. It doesn't recognize that they have a hierarchical relationship that maintains a consistent order and we can't get at the workqueue lockdep_map to set subclasses. We add a destroy workqueue which will have its own class. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	319ff86014	scoutfs: lock recovery info messages Lock recovery is perfectly normal if a server is unmounted and another is elected to take its place. Turn the lock recovery message into an info message instead of a warning and add another info message when lock recovery is complete. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	ec7f60bebb	scoutfs: net conn lifetime tracing Add trace events for network connections. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	a7ce9f22e2	scoutfs: add statfs ioctl Add an ioctl that can fill a user struct with file system info. We're going to use this to find the fsid and rid of a mount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	97f3971dcd	scoutfs: add rid sysfs file Add a "rid" file along the "fsid" file in the per-mount sysfs dir that gives the mounts rid. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	ab7bde9e2c	scoutfs: replace node_id with rid in networking Use the client's rid in networking instead of the node_id. The node_id no longer has to be allocated by the server and sent in the greeting. Instead the client sends it to the server in its greeting. The server then uses the client's announced rid just like it used to use the its node_id. It's used to record clients in the btree and to identify clients in sending and receive processing. The use of the rid in networking calls makes its way to locking and compaction which now use the rid to identify clients intead of the node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	f0a86f05f8	scoutfs: remove unused uniq_name option Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5b258cee3b	scoutfs: refine quorum voting The current quorum voting implementatoin had some rough edges that increased the complexity of the system and introduced undesirable failure modes. We can keep the same basic pattern but move functionality around a few places, and rethink the quorum voting, to end up with a meaningfully simpler system. The motivation for this work was to remove the need to provide a uniq_name option for every mount instance. The first big change is to remove the idea of static configuration slots for mounts. This removes the use of uniq_name. Mounts now simply have a server_addr mount option instead of using their uniq_name to find their address in the configuration. The server can't check the configuration to see if a given connected client's name is found in the quorum config. Clients can set a flag in their sent greeting which indicates that they're a voter. This removes the uniq_name from the greeting and mounted client records. Without a static configuration mounts no longer have dedicated block locations to write to. We increase the size of the region of quorum blocks and have voters simply write to a random block. Overwriting vote blocks is OK because we move from heartbeating design patterns to a protocol strongly based on raft's election. We're using quorum blocks to communicate votes instead of network messages and overwriting blocks is analagous to lossy networks droping vote messages in the raft election protocol. We were using the dedicated per-mount quorum blocks to track mounts that had been elected and needed to be fenced. We no longer have that storage so instead we add the idea of an election log that is stored in every voting block. Readers merge the logs from all the blocks they read and write the resulting merged log in their block. With no static quorum configuration we no longer have to worry about the complexity of changing the slot configurations while they're in use. The only persistent configuration is the number of votes a candidate needs to be elected by a quorum. It was a mistake to use quorum voting blocks to communicate state between the server and the quorum voters. We can easily move the unmount_barrier, server address, and fencing state from the quorum blocks into the super block. The server no longer needs the quorum election info struct to be able to later write its quorum block. It instead writes a few fields in the super. There's only one place where clients need to look to find out who they should connect to or if they can finish unmount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	5929a36747	scoutfs: add server_addr mount option Add a server_addr mount option that takes an ipv4 address. This will be used by the upcoming changes to quorum voting to indicate that a mount should participate in voting and to specify the address that its server should listen on. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	532256271c	scoutfs: simplify scoutfs_write_super() The pattern of advancing and writing a "dirty super" comes from the time when the format had two persistent super blocks. One was kept in memory and modified as changes were made. Advancing it changed which of the two supers would be eventually written. This no longer makes sense now that we only have one super block. Remove the idea of advancing and writing an implicit dirty super block that's stored in the super block info. Instead use a single scoutfs_write_super() which takes the super block struct to write. We still store and increment the hdr.gen in the super block. It used to be used to tell which of the two super blocks are more recent, now it is just some information that can tell us something about the life of the super block. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	1ea75a9d54	scoutfs: add scoutfs_addr sin conversion functions Add some quick functions that let us convert between our persistent packed inet addr struct and native sockaddr_in structs. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	6f5cfd8cc2	scoutfs: use rid instead of node_id in items Use the mount's generated random id in persistent items and the lock that protects them instead of the assigned node_id. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	bd7a7fe97e	scoutfs: use fr identity in pseudo fs paths Use the fr mount identity string in the sysfs/fs/ and debugfs paths we register for each mount. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	7a36a289d2	scoutfs: add rid to trace messages Add the mount rid to traces which included the fsid by converting them to use the super block message format and args. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	754ce95f5c	scoutfs: use rid in console messages Change the console message output to show the fsid:rid mount identity instead of the block device name and device major and minor numbers. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	7acbf4cc8b	scoutfs: add super block format and args Add macros which provide printk format and args for a little string which identifies a specific mount. This will be used in kernel logs and trace messages. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	b147d74967	scoutfs: add per-mount random id Calculate a random id which identifies the life of a particular mount. This will be visible in messages and tracing and will replace the server-assigned node_id in persistent structures and protocols. Signed-off-by: Zach Brown <zab@versity.com>	2019-08-20 15:52:13 -07:00
Zach Brown	d8bc962fc5	scoutfs: unpriv listxattr_hidden only shows .hide. Our hidden attributes are hidden so that they don't leak out of the system when archiving tools transfer xattrs from listxattr along with the file. They're not intended to be secret, in fact users want to see their contents like they want to see other fs metadata that they can't update which describes the system. Make our listxattr ioctl only return hidden xattrs and allow anyone to see the results if they can read the file. Rename it to more accurately describe its intended use. Signed-off-by: Zach Brown <zab@versity.com>	2019-06-28 10:23:55 -07:00

1 2 3 4 5 ...

771 Commits