scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-05 03:44:05 +00:00

Author	SHA1	Message	Date
Chris Kirby	91638191de	Add finer grained options to scoutfs print The default output from scoutfs print can be very large, even when using the -S option. Add three new command line options to allow more targeted selection of btrees and their items. --allocs prints the metadata and data allocators --roots allows the selection of btree roots to walk (logs, srch, fs) --items allows the selection of items to print from the selected btrees Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-11-21 10:39:58 -06:00
Zach Brown	14b65c6360	Fix printing alloc list block extents The list alloc blocks have an array of blknos that are offset by a start field in the block header. The print code wasn't using that and was always referencing the beginning of the array, which could miss blocks. Signed-off-by: Zach Brown <zab@versity.com>	2025-01-22 09:57:21 -08:00
Zach Brown	1bc83e9e2d	Add indx xattr tag support to utils Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	e0bb6ca481	Add quota support to utils Add scoutfs cli commands for managing quotas and add its persistent structures to the print command. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	4a8240748e	Add project ID support Add support for project IDs. They're managed through the _attr_x interfaces and are inherited from the parent directory during creation. Signed-off-by: Zach Brown <zab@versity.com>	2024-06-28 15:09:05 -07:00
Zach Brown	3363b4fb79	Flush device caches in buffered util cmds Add calls to our new device cache flushing helper in commands that use buffered reads. Signed-off-by: Zach Brown <zab@versity.com>	2023-01-18 10:52:02 -08:00
Zach Brown	49df98f5a8	Add skip-likely-huge print option Add an option to skip printing structures that are likely to be so huge that the print output becomes completely unwieldly on large systems. Signed-off-by: Zach Brown <zab@versity.com>	2022-07-06 15:07:57 -07:00
Zach Brown	89ca903c41	Print log trees get/commit seqs Back when we added the get/commit transaction sequence numbers to the log_trees we forgot to add them to the scoutfs print output. Signed-off-by: Zach Brown <zab@versity.com>	2022-01-19 09:21:02 -08:00
Bryant G. Duffy-Ly	95f2a87864	Fix scoutfs print <data_dev> hang If a user tries to print a data device exit early if it is data device. Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>	2021-11-08 16:16:13 -06:00
Zach Brown	80ee2c6d57	Harden client transaction processing There are a few bad corner cases in the state machine that governs how client transactions are opened, modified, and committed. The worst problem is on the server side. All server request handlers need to cope with resent requests without causing bad side effects. Both get_log_trees and commit_log_trees would try to fully processes resent requests. _get_log_trees() looks safe because it works with the log_trees that was stored previously. _commit_log_trees() is not safe because it can rotate out the srch log file referenced by the sent log_trees every time it's processed. This could create extra srch entries which would delete the first instance of entries. Worse still, by injecting the same block structure into the system multiple times it ends up causing multiple frees of the blocks that make up the srch file. The client side problems are slightly different, but related. There aren't strong constraints which guarantee that we'll only send a commit request after a get request succeeds. In crazy circumstances the commit request in the write worker could come before the first get in mount succeeds. Far worse is that we can send multiple commit requests for one transaction if it changes as we get errors during multiple queued write attempts, particularly if we get errors from get_log_trees after having successfully committed. This hardens all these paths to ensure a strict sequence of get_log_trees, transaction modification, and commit_log_trees. On the server we add _trans_seq fields to the log_trees struct so that both get_ and commit_ can see that they've already prepared a commit to send or have already committed the incoming commit, respectively. We can use the get_trans_seq field as the trans_seq of the open transaction and get rid of the entire seperate mechanism we used to have for tracking open trans seqs in the clients. We can get the same info by walking the log_trees and looking at their _trans_seq fields. In the client we have the write worker immediately return success if mount hasn't opened the first transaction. Then we don't have the worker return to allow further modification until it has gotten success from get_log_trees. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	95ed36f9d3	Maintain inode count in super and log trees Add a count of used inodes to the super block and a change in the inode count to the log_trees struct. Client transactions track the change in inode count as they create and delete inodes. The log_trees delta is added to the count in the super as finalized log_trees are deleted. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:47 -07:00
Zach Brown	366f615c9f	Add support for our format version We had previously started on a relatively simple notion of an interoperability version which wasn't quite right. This fleshes out support for a more functional format version. The super blocks have a single version that defines behaviour of the running system. The code supports a range of versions and we add some initial interfaces for updating the version while the system is offline. All of this together should let us safely change the underlying format over time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	ac2587017e	Add write_nr to quorum blocks Add a write_nr field to the quorum block header which is incremented with every write. Each event also gets a write_nr field that is set to the incremented value from the header. This gives us a history of the order of event updates that isn't sensitive to misconfigured time. Signed-off-by: Zach Brown <zab@versity.com>	2021-10-28 12:30:46 -07:00
Zach Brown	ea2b01434e	Add support for i_version This adds i_version to our inode and maintains it as we allocate, load, modify, and store inodes. We set the flag in the superblock so in-kernel users can use i_version to see changes in our inodes. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	b9a0f1709f	Add xattr .totl. tag Add the .totl. xattr tag. When the tag is set the end of the name specifies a total name with 3 encoded u64s separated by dots. The value of the xattr is a u64 that is added to the named total. An ioctl is added to read the totals. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-13 14:41:07 -07:00
Zach Brown	a59fd5865d	Add seq and flags to btree items The fs log btrees have values that start with a header that stores the item's seq and flags. There's a lot of sketchy code that manipulates the value header as items are passed around. This adds the seq and flags as core item fields in the btree. They're only set by the interfaces that are used to store fs items: _insert_list and _merge. The rest of the btree items that use the main interface don't work with the fields. This was done to help delta items discover when logged items have been merged before the finalized lob btrees are deleted and the code ends up being quite a bit cleaner. Signed-off-by: Zach Brown <zab@versity.com>	2021-09-09 14:44:55 -07:00
Zach Brown	bb571377dc	Don't merge newer items past older We have a problem where items can appear to go backwards in time because of the way we chose which log btrees to finalize and merge. Because we don't have versions in items in the fs_root, and even might not have items at all if they were deleted, we always assume items in log btrees are newer than items in the fs root. This creates the requirement that we can't merge a log btree if it has items that are also present in older versions in other log btrees which are not being merged. The unmerged old item in the log btree would take precedent over the newer merged item in the fs root. We weren't enforcing this requirement at all. We used the max_item_seq to ensure that all items were older than the current stable seq but that says nothing about the relationship between older items in the finalized and active log btrees. Nothing at all stops an active btree from having an old version of a newer item that is present in another mount's finalized log btree. To reliably fix this we create a strict item seq discontinuity between all the finalized merge inputs and all the active log btrees. Once any log btree is naturally finalized the server forced all the clients to group up and finalize all their open log btrees. A merge operation can then safely operate on all the finalized trees before any new trees are given to clients who would start using increasing items seqs. Signed-off-by: Zach Brown <zab@versity.com>	2021-08-25 10:14:38 -07:00
Zach Brown	4c1181c055	Remove first_ and last_ super blkno fields There are fields in the super block that specify the range of blocks that would be used for metadata or data. They are from the time when a single block device was carved up into regions for metadata and data. They don't make sense now that we have separate metadata and data block devices. The starting blkno is static and we go to the end of the device. This removes the fields now that they serve no purpose. The only use of them to check that freed extents fell within the correct bounds can still be performed by using the static starting number or roughly using the size of the devices. It's not perfect, but this is already only a check to see that the blknos aren't utter nonsense. We're removing the fields now to avoid having to update them while worrying about users when resizing devices. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-30 13:22:42 -07:00
Zach Brown	73bf916182	Return ENOSPC as space gets low Returning ENOSPC is challenging because we have clients working on allocators which are a fraction of the whole and we use COW transactions so we need to be able to allocate to free. This adds support for returning ENOSPC to client posix allocators as free space gets low. For metadata, we reserve a number of free blocks for making progress with client and server transactions which can free space. The server sets the low flag in a client's allocator if we start to dip into reserved blocks. In the client we add an argument to entering a transaction which indicates if we're allocating new space (as opposed to just modifying existing data or freeing). When an allocating transaction runs low and the server low flag is set then we return ENOSPC. Adding an argument to transaciton holders and having it return ENOSPC gave us the opportunity to clean it up and make it a little clearer. More work is done outside the wait_event function and it now specifically waits for a transaction to cycle when it forces a commit rather than spinning until the transaction worker acquires the lock and stops it. For data the same pattern applies except there are no reserved blocks and we don't COW data so it's a simple case of returning the hard ENOSPC when the data allocator flag is set. The server needs to consider the reserved count when refilling the client's meta_avail allocator and when swapping between the two meta_avail and meta_free allocators. We add the reserved metadata block count to statfs_more so that df can subtract it from the free meta blocks and make it clear when enospc is going to be returned for metadata allocations. We increase the minimum device size in mkfs so that small testing devices provide sufficient reserved blocks. And finally we add a little test that makes sure we can fill both metadata and data to ENOSPC and then recover by deleting what we filled. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-07 14:13:14 -07:00
Zach Brown	07210b5734	Reliably delete orphaned inodes Orphaned items haven't been deleted for quite a while -- the call to the orphan inode scanner has been commented out for ages. The deletion of the orphan item didn't take rid zone locking into account as we moved deletion from being strictly local to being performed by whoever last used the inode. This reworks orphan item management and brings back orphan inode scanning to correctly delete orphaned inodes. We get rid of the rid zone that was always _WRITE locked by each mount. That made it impossible for other mounts to get a _WRITE lock to delete orphan items. Instead we rename it to the orphan zone and have orphan item callers get _WRITE_ONLY locks inside their inode locks. Now all nodes can create and delete orphan items as they have _WRITE locks on the associated inodes. Then we refresh the orphan inode scanning function. It now runs regularly in the background of all mounts. It avoids creating cluster lock contention by finding candidates with unlocked forest hint reads and by testing inode caches locally and via the open map before properly locking and trying to delete the inode's items. Signed-off-by: Zach Brown <zab@versity.com>	2021-07-02 10:52:46 -07:00
Zach Brown	3488b4e6e0	Add scoutfs print support for log merge items Add support for printing all the items in the log_merge tree that the server uses to track log merging. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	c482204fcf	Clean up btree root printing in superblock Over time the printing of the btree roots embedded in the super block has gotten a little out of hand. Add a helper macro for the printf format and args and re-order them to match their order in the superblock. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	9711fef122	Update for core, trans, and item seq use We now have a core seq number in the super that is advanced for multiple users. The client transaction seq comes from the core seq so we remove the trans_seq from the super. The item version is also converted to use a seq that's derived from the core seq. Signed-off-by: Zach Brown <zab@versity.com>	2021-06-17 09:36:00 -07:00
Zach Brown	38a4a56741	Stop writing to other quorum slot blocks The core quorum work loop assumes that it has exclusive access to its slot's quorum block. It uniquely marks blocks it writes and verifies the marks on read to discover if another mount has written to its slot under the assumption that this must be a configuration error that put two mounts in the same slot. But the design of the leader bit in the block violates the invariant that only a slot will write to its block. As the server comes up and fences previous leaders it writes to their block to clear their leader bit. The final hole in the design is that because we're fencing mounts, not slots, each slot can have two mounts in play. An active mount can be using the slot and there can still be a persistent record of a previous mount in the slot that crashed that needs to be fenced. All this comes together to have the server fence an old mount in a slot while a new mount is coming up. The new mount sees the mark change and freaks out and stops participating in quorum. The fix is to rework the quorum blocks so that each slot only writes to its own block. Instead of the server writing to each fenced mount's slot, it writes a fence event to its block once all previous mounts have been fenced. We add a bit of bookkeeping so that the server can discover when all block leader fence operations have completed. Each event gets its own term so we can compare events to discover live servers. We get rid of the write marks and instead have an event that is written as a quorum agent starts up and is then checked on every read to make sure it still matches. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-31 13:10:45 -07:00
Zach Brown	877e30d60f	Add client address to mounted_client item Add the peername of the client's connected socket to its mounted_client item as it mounts. If the client doesn't recover then fencing can use the IP to find the host to fence. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-26 14:18:39 -07:00
Zach Brown	54644a5074	Add data_alloc_zone_blocks volume option Add the data_alloc_zone_blocks volume option. This changes the behaviour of the server to try and give mounts free data extents which fall in exclusive fixed-size zones. We add the field to the scoutfs_volume_options struct and add it to the set_volopt server handler which enforces constrains on the size of the zones. We then add fields to the log_trees struct which records the size of the zones and sets bits for the zones that contain free extents in the data_avail allocator root. The get_log_trees handler is changed to read all the zone bitmaps from all the items, pass those bitmaps in to _alloc_move to direct data allocations, and finally update the bitmaps in the log_trees items to cover the newly allocated extents. The log_trees data_alloc_zone fields are cleared as the mount's logs are reclaimed to indicate that the mount is no longer writing to the zone. The policy mechanism of finding free extents based on the bitmaps is ipmlemented down in _data_alloc_move(). Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:31:02 -07:00
Zach Brown	9de3ae6dcb	Index free extents by order of length Allocators store free extents in two items, one sorted by their blkno position and the other by their precise length. The length index makes it easy to search for precise extent lengths, but it makes it hard to search for a large extent within a given blkno region. Skipping in the blkno dimension has to be done for every precise length value. We don't need that level of precision. If we index the extents by a coarser order of the length then we have a fixed number of orders in which we have to skip in the blkno dimension when searching within a specific region. This changes the length item to be stored at the log(8) order of the length of the extents. This groups extents into orders that are close to the human-friendly base 10 orders of magnitude. With this change the order field in the key no longer stores the precise extent length. To preserve the length of the extent we need to use another field. The only 64bit field remaining is the first which is a higher comparision priority than the type. So we use the highest comparison priority zone field to differentiate the position and order indexes and can now use all three 64bit fields in the key. Finally, we have to be careful when constructing a key to use _next when searching for a large extent. Previously keys were relying on the magic property that building a key from an extent length of 0 ended up at the key value -0 = 0. That only worked because we never stored zero length extents. We now store zero length orders so we can't use the negative trick anymore. We explicitly treat 0 length extents carefully when building keys and we subtract the order from U64_MAX to store the orders from largest to smallest. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-21 15:25:56 -07:00
Zach Brown	0aa6005c99	Add volume options super, server, and sysfs Introduce global volume options. They're stored in the superblock and can be seen in sysfs files that use network commands to get and set the options on the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-05-19 14:15:06 -07:00
Zach Brown	c6fd807638	Use recov to manage lock recovery Now that we have the recov layer we can have the lock server use it to track lock recovery. The lock server no longer needs its own recovery tracking structures and can instead call recov. We add a call for the server to call to kick lock processing once lock recovery finishes. We can get rid of the persistent lock_client items now that the server is driving recovery from the mounted_client items. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 12:10:35 -07:00
Andy Grover	efe5d92458	Reserve space in superblock for IPv6 addresses Define a family field, and add a union for IPv4 and v6 variants, although v6 is not supported yet. Family field is now used to determine presence of address in a quorum slot, instead of checking if addr is zero. Signed-off-by: Andy Grover <agrover@versity.com>	2021-03-12 14:10:42 -08:00
Zach Brown	f18fa0e97a	Update scoutfs print for centralized block_ref Update scoutfs print to use the new block_ref struct instead of the handful of per-block type ref structs that we had accumulated. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	57f34e90e9	Use mounted_client item as sign of farewell As clients unmount they send a farewell request that cleans up persistent state associated with the mount. The client needs to be sure that it gets processed, and we must maintain a majority of quorum members mounted to be able to elect a server to process farewell requests. We had a mechanism using the unmount_barrier fields in the greeting and super_block to let the final unmounting quorum majority know that their farewells have been processed and that they didn't need to keep trying to reconnect. But we missed that we also need this out of band farewell handling signal for non-quorum member clients as well. The server can send farewells to a non-member client as well as the final majority and then tear down all the connections before the non-quorum client can see its farewell response. It also needs to be able to know that its farewell has been processed before the server let the final majority unmount. We can remove the custom unmount_barrier method and instead have all unmounting clients check for their mounted_client item in the server's btree. This item is removed as the last step of farewell processing so if the client sees that it has been removed it knows that it doesn't need to resend the farewell and can finish unmounting. This fixes a bug where a non-quorum unmount could hang if it raced with the final majority unmounting. I was able to trigger this hang in our tests with 5 mounts and 3 quorum members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	87fcad5428	Update scoutfs mkfs and print for quorum slots Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Andy Grover	d731c1577e	Filesystem version instead of format hash check Instead of hashing headers, define an interop version. Do not mount superblocks that have a different version, either higher or lower. Since this is pretty much the same as the format hash except it's a constant, minimal code changes are needed. Initial dev version is 0, with the intent that version will be bumped to 1 immediately prior to tagging initial release version. Update README. Fix comments. Add interop version to notes and modinfo. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-15 10:53:00 -08:00
Andy Grover	d48b447e75	Do not set -Wpadded except for checking kmod-shared headers Remove now-unneeded manual padding in arg structs. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	e0a2175c2e	Use argp info instead of duplicating for cmd_register() Make it static and then use it both for argp_parse as well as cmd_register_argp. Split commands into five groups, to help understanding of their usefulness. Mention that each command has its own help text, and that we are being fancy to keep the user from having to give fs path. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	7befc61482	Implement argp support for mkfs and add --force Support max-meta-size and max-data-size using KMGTP units with rounding. Detect other fs signatures using blkid library. Detect ScoutFS super using magic value. Move read_block() from print.c into util.c since blkid also needs it. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Andy Grover	6b5ddf2b3a	Implement argp support for print Print warning if printing a data dev, you probably wanted the meta dev. Change read_block to return err value. Otherwise there are confusing ENOMEM messages when pread() fails. e.g. try to print /dev/null. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 10:47:47 -08:00
Andy Grover	8f72d16609	scoutfs-utils: Use separate block devices for metadata and data mkfs: Take two block devices as arguments. Write everything to metadata dev, and the superblock to the data dev. UUIDs match. Differentiate by checking a bit in a new "flags" field in the superblock. Refactor device_size() a little. Convert spaces to tabs. Move code to pretty-print sizes to dev.c so we can use it in error messages there, as well as in mkfs.c. print: Include flags in output. Add -D and -M options for setting max dev sizes Allow sizes to be specified using units like "K", "G" etc. Note: -D option replaces -S option, and uses above units rather than the number of 4k data blocks. Update man pages for cmdline changes. Signed-off-by: Andy Grover <agrover@versity.com>	2020-11-19 11:41:54 -08:00
Zach Brown	66c6331131	scoutfs-utils: add max item vers to log trees Add a field to the log_trees struct which records the greatest item version seen in items in the tree. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:12:52 -07:00
Zach Brown	42bf0980b6	scoutfs-utils: remove scoutfs_log_trees_val We're just using the one log_trees struct for both network messages and persistent btree item values. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-30 11:12:52 -07:00
Andy Grover	5701182665	scoutfs-utils: Enable -Wpadded The compiler will complain if it sees any padding. Fix a spot in print.c for this. Signed-off-by: Andy Grover <agrover@versity.com>	2020-10-29 14:15:22 -07:00
Zach Brown	6b1dd980f0	scoutfs-utils: remove btree item owner We no longer have an owner offset trailing btree item values. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-29 14:15:22 -07:00
Zach Brown	ea7c41d876	scoutfs-utils: remove free_*_blocks super fields The kernel is no longer storing the total free space in all allocators in super block fields. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:41 -07:00
Zach Brown	838e293413	scoutfs-utils: update compaction item printing We now only use one srch file compaction struct and we store it in PENDING and BUSY key types. Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:41 -07:00
Zach Brown	23711f05f6	scoutfs-utils: alloc and data uses full extents Signed-off-by: Zach Brown <zab@versity.com>	2020-10-26 15:19:41 -07:00
Zach Brown	e2a919492d	scoutfs-utils: remove unused xattr index items We're now using the .srch. xattr tags. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:28 -07:00
Zach Brown	f04a636229	scoutfs-utils: add support for srch Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:28 -07:00
Zach Brown	5f0dbc5f85	scoutfs-utils: remove radix _first fields The recent cleanup of the radix allocator included removing tracking of the first set bits or references in blocks. Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:28 -07:00
Zach Brown	39993d8b5f	scoutfs-utils: use larger metadata blocks Signed-off-by: Zach Brown <zab@versity.com>	2020-08-26 14:39:28 -07:00

1 2 3

146 Commits