scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-03 10:55:20 +00:00

Author	SHA1	Message	Date
Zach Brown	a27c54568c	Handle back to back invalidation requests The client's incoming lock invalidation request handler triggers a BUG_ON if it gets a request for a lock that is already processing a previous invalidation request. The server is supposed to only send one request at a time. The problem is that the batched invalidation request handling will send responses outside of spinlock coverage before reacquirin the lock and finishing processing once the response send has been successful. This gives a window for another invalidation request to arrive after the response was sent but before the invalidation finished processing. This triggers the bug. The fix is to mark the lock such that we can recognize a valid second request arriving after we send the response but before we finish processing. If it arrives we'll continue invalidation processing with the arguments from the new request. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-22 17:00:50 -07:00
Andy Grover	cbb031bb5d	Merge pull request #32 from versity/zab/block_rhashtable_insert_fixes Zab/block rhashtable insert fixes	2021-04-13 10:42:17 -07:00
Zach Brown	c3290771a0	Block cache use rht _lookup_ insert for EEXIST The sneaky rhashtable_insert_fast() can't return -EEXIST despite the last line of the function REALLY making it look like it can. It just inserts new objects at the head of the bucket lists without comparing the insertion with existing objects. The block cache was relying on insertion to resolve duplicate racing allocated blocks. Because it couldn't return -EEXIST we could get duplicate cached blocks present in the hash table. rhashtable_lookup_insert_fast() fixes this by actually comparing the inserted objects key with the objects found in the insertion bucket. A racing allocator trying to insert a duplicate cached block will get an error, drop their allocated block, and retry their lookup. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 09:24:23 -07:00
Zach Brown	cf3cb3f197	Wait for rhashtable to rehash on insert EBUSY The rhashtable can return EBUSY if you insert fast enough to trigger an expansion of the next table size that is waiting to be rehashed in an rcu callback. If we get EBUSY from rhasthable_insert we call synchronize_rcu to wait for the rehash to complete before trying again. This was hit in testing restores of a very large namespace and took a few hours to hit. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-13 09:24:23 -07:00
Andy Grover	cb4ed98b3c	Merge pull request #31 from versity/zab/block_shrink_wait_for_rebalance Block cache shrink restart waits for rcu callbacks	2021-04-08 09:03:12 -07:00
Zach Brown	9ee7f7b9dc	Block cache shrink restart waits for rcu callbacks We're seeing cpu livelocks in block shrinking where counters show that a single block cache shrink call is only getting EAGAIN from repeated rhashtable walk attempts. It occurred to me that the running task might be preventing an RCU grace period from ending by never blocking. The hope of this commit is that by waiting for rcu callbacks to run we'll ensure that any pending rebalance callback runs before we retry the rhashtable walk again. I haven't been able to reproduce this easily so this is a stab in the dark. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-07 12:50:50 -07:00
Zach Brown	300791ecfa	Merge pull request #29 from agrover/cleanup Cleanup	2021-04-07 12:27:00 -07:00
Andy Grover	4630b77b45	cleanup: Use flexible array members instead of 0-length arrays See Documentation/process/deprecated.rst:217, items[] now preferred over items[0]. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-07 10:14:47 -07:00
Andy Grover	bdc43ca634	cleanup: Fix ESTALE handling in forest_read_items Kinda weird to goto back to the out label and then out the bottom. Just return -EIO, like forest_next_hint() does. Don't call client_get_roots() right before retry, since is the first thing retry does. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-07 10:14:04 -07:00
Andy Grover	6406f05350	cleanup: Remove struct net_lock_grant_response We're not using the roots member of this struct, so we can just use struct scoutfs_net_lock directly. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-07 10:13:56 -07:00
Andy Grover	820b7295f0	cleanup: Unused LIST_HEADs Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 16:23:41 -07:00
Zach Brown	b3611103ee	Merge pull request #26 from agrover/tmpfile Support O_TMPFILE and allow MOVE_BLOCKS into released extents	2021-04-05 15:23:41 -07:00
Andy Grover	0deb232d3f	Support O_TMPFILE and allow MOVE_BLOCKS into released extents Support O_TMPFILE: Create an unlinked file and put it on the orphan list. If it ever gains a link, take it off the orphan list. Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges. Ioctl callers must set a new flag to enable this operation mode. RH-compat: tmpfile support it actually backported by RH into 3.10 kernel. We need to use some of their kabi-maintaining wrappers to use it: use a struct inode_operations_wrapper instead of base struct inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets RH's modified vfs_tmpfile() find our tmpfile fn pointer. Add a test that tests both creating tmpfiles as well as moving their contents into a destination file via MOVE_BLOCKS. xfstests common/004 now runs because tmpfile is supported. Signed-off-by: Andy Grover <agrover@versity.com>	2021-04-05 14:23:44 -07:00
Andy Grover	1366e254f9	Merge pull request #30 from versity/zab/srch_block_ref_leak Zab/srch block ref leak	2021-04-01 16:50:34 -07:00
Zach Brown	1259f899a3	srch compaction needs to prepare alloc for commit The srch client compaction work initializes allocators, dirties blocks, and writes them out as its transaction. It forgot to call the pre-commit allocator prepare function. The prepare function drops block references used by the meta allocator during the transaction. This leaked block references which kept blocks from being freed by the shrinker under memory pressure. Eventually memory was full of leaked blocks and the shrinker walked all of them looking blocks to free, resulting in an effective livelock that ground the system to a crawl. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-01 13:04:40 -07:00
Zach Brown	2d393f435b	Warn on leaked block refs on unmount By the time we get to destroying the block cache we should have put all our block references. Warn as we tear down the blocks if we see any blocks that still have references, implying a ref leak. This caught a leak caused by srch compaction forgetting to put allocator list block refs. Signed-off-by: Zach Brown <zab@versity.com>	2021-04-01 13:04:06 -07:00
Andy Grover	09c879bcf1	Merge pull request #25 from versity/zab/client_greeting_items_exist Zab/client greeting items exist	2021-03-16 15:57:55 -07:00
Zach Brown	3de703757f	Fix weird comment editing error That comment looked very weird indeed until I recognized that I must have forgotten to delete the first two attempts at starting the sentence. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 12:02:05 -07:00
Zach Brown	7d67489b0c	Handle resent initial client greetings The very first greeting a client sends is unique becuase it doesn't yet have a server_term field set and tells the server to create items to track the client. A server processing this request can create the items and then shut down before the client is able to receive the reply. They'll resend the greeting without server_term but then the next server will get -EEXIST errors as it tries to create items for the client. This causes the connection to break, which the client tries to reestablish, and the pattern repeats indefinitely. The fix is to simply recognize that -EEXIST is acceptable during item creation. Server message handlers always have to address the case where a resent message was already processed by a previous server but it's response didn't make it to the client. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 11:56:26 -07:00
Zach Brown	73084462e9	Remove unused client greeting_umb Remove an old client info field from the unmount barrier mechanism which was removed a while ago. It used to be compared to a super field to decide to finish unmount without reconnecting but now we check for our mounted_client item in the server's btree. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 10:04:42 -07:00
Zach Brown	8c81af2b9b	Merge pull request #22 from agrover/ipv6 Reserve space in superblock for IPv6 addresses	2021-03-15 16:04:26 -07:00
Andy Grover	efe5d92458	Reserve space in superblock for IPv6 addresses Define a family field, and add a union for IPv4 and v6 variants, although v6 is not supported yet. Family field is now used to determine presence of address in a quorum slot, instead of checking if addr is zero. Signed-off-by: Andy Grover <agrover@versity.com>	2021-03-12 14:10:42 -08:00
Andy Grover	d39e56d953	Merge pull request #24 from versity/zab/fix-block-stale-reads Zab/fix block stale reads	2021-03-11 09:33:03 -08:00
Zach Brown	5661a1fb02	Fix block-stale-reads test The block-stale-reads test was built from the ashes of a test that used counters and triggers to work with the btree when it was only used on the server. The initial quick translation to try and trigger block cache retries while the forest called the btree got so much wrong. It was still trying to use some 'cl' variable that didn't refer to the client any more, the trigger helpers now call statfs to find paths and can end up triggering themselves. and many more counters stale reads can happen throughout the system while we're working -- not just one from our trigger. This fixes it up to consistently use fs numbers instead of the silly stale cl variable and be less sensitive to triggers firing and counter differences. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-10 12:36:41 -08:00
Zach Brown	12fa289399	Add t_trigger_arm_silent t_trigger_arm always output the value of the trigger after arming on the premise that tests required the trigger being armed. In the process of showing the trigger it calls a bunch of t_ helpers that build the path to the trigger file using statfs_more to get the rid of mounts. If the trigger being armed is in the server's mount and the specific trigger test is fired by the server's statfs_more request processing then the trigger can be fired before read its value. Tests can inconsistently fail as the golden output shows the trigger being armed or not depending on if it was in the server's mount or not. t_trigger_arm_silent doesn't output the value of the armed trigger. It can be used for low level triggers that don't rely on reading the trigger's value to discover that their effect has happened. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-10 12:36:34 -08:00
Zach Brown	75e8fab57c	Add t_counter_diff_changed Tests can use t_counter_diff to put a message in their golden output when a specific change in counters is expected. This adds t_counter_diff_changed to output a message that indicates change or not, for tests that want to see counters change but the amount of change doesn't need to be precisely known. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-10 12:32:04 -08:00
Zach Brown	513d6b2734	Merge pull request #20 from versity/zab/remove_trans_spinlock Zab/remove trans spinlock	2021-03-04 13:59:07 -08:00
Zach Brown	f8d39610a2	Only get inode writeback_lock when adding inodes Each transaction maintains a global list of inodes to sync. It checks the inode and adds it in each write_end call per OS page. Locking and unlocking the global spinlock was showing up in profiles. At the very least, we can only get the lock once per large file that's written during a transaction. This will reduce spinlock traffic on the lock by the number of pages written per file. We'll want a better solution in the long run, but this helps for now. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Zach Brown	c470c1c9f6	Allow read-mostly _alloc_meta_low Each transaction hold makes multiple calls to _alloc_meta_low to see if the transaction should be committed to refill allocators before the caller's hold is acquired and they can dirty blocks in the transaction. _alloc_meta_low was using a spinlock to sample the allocator list_head blocks to determine if there was space available. The lock and unlock stores were creating significant cacheline contention. The _alloc_meta_low calls are higher frequency than allocations. We can use a seqlock to have exclusive writers and allow concurrent _alloc_meta_low readers who retry if a writer intervenes. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Andy Grover	cad902b9cd	Merge pull request #19 from versity/zab/block_crash_and_consistency Zab/block crash and consistency	2021-03-04 10:57:27 -08:00
Zach Brown	e163f3b099	Use atomic holders instead of trans info lock We saw the transaction info lock showing up in profiles. We were doing quite a lot of work with that lock held. We can remove it entirely and use an atomic. Instead of a locked holders count and writer boolean we can use an atomic holders and have a high bit indicate that the write_func is pending. This turns the lock/unlock pairs in hold and release into atomic inc/cmpxchg/dec operations. Then we were checking allocators under the trans lock. Now that we have an atomic holders count we can increment it to prevent the writer from commiting and release it after the checks if we need another commit before the hold. And finally, we were freeing our allocated reservation struct under the lock. We weren't actually doing anything with the reservation struct so we can use journal_info as the nested hold counter instead of having it point to an allocated and freed struct. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 14:18:04 -08:00
Zach Brown	a508baae76	Remove unused triggers As the implementation shifted away from the ring of btree blocks and LSM segments we lost callers to all these triggers. They're unused and can be removed. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:50:00 -08:00
Zach Brown	208c51d1d2	Update stale block reading test The previous test that triggered re-reading blocks, as though they were stale, was written in the era where it only hit btree blocks and everything else was stored in LSM segments. This reworks the test to make it clear that it affects all our block readers today. The test only exercise the core read retry path, but it could be expanded to test callers retrying with newer references after they get -ESTALE errors. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:50:00 -08:00
Zach Brown	9450959ca4	Protect stale block readers from local dirtying Our block cache consistency mechanism allows readers to try and read stale block references. They check block headers of the block they read to discover if it has been modified and they should retry the read with newer block references. For this to be correct the block contents can't change under the readers. That's obviously true in the simple imagined case of one node writing and another node reading. But we also have the case where the stale reader and dirtying writer can be concurrent tasks in the same mount which share a block cache. There were a two failure cases that derive from the order of readers and writers working with blocks. If the reader goes first, the writer could find the existing block in the cache and modify it while the reader assumes that it is read only. The fix is to have the writer always remove any existing cached block and insert a newly allocated block into the cache with the header fields already changed. Any existing readers will still have their cached block references and any new readers will see the modified headers and return -ESTALE. The next failure comes from readers trying to invalidate dirty blocks when they see modified headers. They assumed that the existing cached block was old and could be dropped so that a new current version could be read. But in this case a local writer has clobbered the reader's stale block and the reader should immediately return -ESTALE. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:59 -08:00
Zach Brown	6237f0adc5	Add _block_dirty_ref to dirty blocks in one place To create dirty blocks in memory each block type caller currently gets a reference on a created block and then dirties it. The reference it gets could be an existing cached block that stale readers are currently using. This creates a problem with our block consistency protocol where writers can dirty and modify cached blocks that readers are currently reading in memory, leading to read corruption. This commit is the first step in addressing that problem. We add a scoutfs_block_dirty_ref() call which returns a reference to a dirtied block from the block core in one call. We're only changing the callers in this patch but we'll be reworking the dirtying mechanism in an upcoming patch to avoid corrupting readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	f18fa0e97a	Update scoutfs print for centralized block_ref Update scoutfs print to use the new block_ref struct instead of the handful of per-block type ref structs that we had accumulated. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	0969a94bfc	Check one block_ref struct in block core Each of the different block types had a reading function that read a block and then checked their reference struct for their block type. This gets rid of each block reference type and has a single block_ref type which is then checked by a single ref reading function in the block core. By putting ref checking in the core we no longer have to export checking the block header crc, verifying headers, invalidating blocks, or even reading raw blocks themseves. Everyone reads refs and leaves the checking up to the core. The changes don't have a significant functional effect. This is mostly just changing types and moving code around. (There are some changes to visible counters.) This shares code, which is nice, but this is putting the block reference checking in one place in the block core so that in a few patches we can fix problems with writers dirtying blocks that are being read. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	b1b75cbe9f	Fix block cache shrink and read racing crash The block cache wasn't safely racing readers walking the rcu radix_tree and the shrinker walking the LRU list. A reader could get a reference to a block that had been removed from the radix and was queued for freeing. It'd clobber the free's llist_head union member by putting the block back on the lru and both the read and free would crash as they each corrupted each other's memory. We rarely saw this in heavy load testing. The fix is to clean up the use of rcu, refcounting, and freeing. First, we get rid of the LRU list. Now we don't have to worry about resolving racing accesses of blocks between two independent structures. Instead of shrinking walking the LRU list, we can mark blocks on access such that shrinking can walk all blocks randomly and expect to quickly find candidates to shrink. To make it easier to concurrently walk all the blocks we switch to the rhashtable instead of the radix tree. It also has nice per-bucket locking so we can get rid of the global lock that protected the LRU list and radix insertion. (And it isn't limited to 'long' keys so we can get rid of the check for max meta blknos that couldn't be cached.) Now we need to tighten up when read can get a reference and when shrink can remove blocks. We have presence in the hash table hold a refcount but we make it a magic high bit in the refcount so that it can be differentiated from other references. Now lookup can atomically get a reference to blocks that are in the hash table, and shrinking can atomically remove blocks when it is the only other reference. We also clean up freeing a bit. It has to wait for the rcu grace period to ensure that no other rcu readers can reference the blocks its freeing. It has to iterate over the list with _safe because it's freeing as it goes. Interestingly, when reworking the shrinker I noticed that we weren't scaling the nr_to_scan from the pages we returned in previous shrink calls back to blocks. We now divide the input from pages back into blocks. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:15 -08:00
Zach Brown	0f14826ff8	Merge pull request #18 from versity/zab/quorum_slots_unmount Zab/quorum slots unmount	2021-02-22 13:34:25 -08:00
Zach Brown	336d521e44	Use spinlock to protect server farewell list We had a mutex protecting the list of farewell requests. The critical sections are all very short so we can use a spinlock and be a bit clearer and more efficient. While we're at it, refactor freeing to free outside of the criticial section. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	4fab75b862	Account for non-quorum in server farewell The server has to be careful to only send farewell responses to quorum clients once it knows that it won't need their vote to elect a leader to server remaining clients. The logic for doing this forgot to take non-quorum clients into account. It would send farewell requests to all the final majority of quorum members once they all tried to unmount. This could leave non-quorum clients hung in unmount trying to send their farewell requests. The fix is to count mouted_clients items for non-quorum clients and hold off on sending farewell requests to the final majority until those non-quorum clients have unmounted. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	f6f72e7eae	Resume running the mount-unmount-race test The recent quorum and unmount fixes should have addressed the failures we were seeing in the mount-unmount-race test. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	9878312b4d	Update man pages for quorum slot changes Update the man pages with descriptions of the new mkfs -Q quorum slot configuration and quorum_slot_nr mount option. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	7421bd1861	Filter all test device digits to 0 We mask device numbers in command output to 0:0 so that we can have consistent golden test output. The device number matching regex responsible for this missed a few digits. It didn't show up until we both tested enough mounts to get larger device minor numbers and fixed multi-mount consistency so that the affected tests didn't fail for other reasons. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	1db6f8194d	Update xfstests to use quorum slot options Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	2de7692336	Unmount mount point, not device Our test unmount function unmounted the device instead of the mount point. It was written this way back in an old version of the harness which didn't track mount points. Now that we have mount points, we can just unmount that. This stops the umount command from having to search through all the current mounts looking for the mountpoint for the device it was asked to unmount. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	8c1d96898a	Log wait failure in mount-unmount-race test I got a test failure where waiting returned an error, but it wasn't clear what the error was or where it might have come from. Add more logging so that we learn more about what might have gone wrong. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	090646aaeb	Update repo README.md for quorum slots Update the example configuration in the README to specify the quorum slots in mkfs arguments and mount options. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	d53350f9f1	Consistently lock server mounted_clients btree The mounted_clients btree stores items to track mounted clients. It's modified by multiple greeting workers and the farewell work. The greeting work was serialized by the farewell_mutex, but the modifications in the farewell thread weren't protected. This could result in modifications between the threads being lost if the dirty block reference updates raced in just the right way. I saw this in testing with deletions in farewell being lost and then that lingering item preventing unmount because the server thought it had to wait for a remaining quorum member to unmount. We fix this by adding a mutex specifically to protect the mounted_clients btree in the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	57f34e90e9	Use mounted_client item as sign of farewell As clients unmount they send a farewell request that cleans up persistent state associated with the mount. The client needs to be sure that it gets processed, and we must maintain a majority of quorum members mounted to be able to elect a server to process farewell requests. We had a mechanism using the unmount_barrier fields in the greeting and super_block to let the final unmounting quorum majority know that their farewells have been processed and that they didn't need to keep trying to reconnect. But we missed that we also need this out of band farewell handling signal for non-quorum member clients as well. The server can send farewells to a non-member client as well as the final majority and then tear down all the connections before the non-quorum client can see its farewell response. It also needs to be able to know that its farewell has been processed before the server let the final majority unmount. We can remove the custom unmount_barrier method and instead have all unmounting clients check for their mounted_client item in the server's btree. This item is removed as the last step of farewell processing so if the client sees that it has been removed it knows that it doesn't need to resend the farewell and can finish unmounting. This fixes a bug where a non-quorum unmount could hang if it raced with the final majority unmounting. I was able to trigger this hang in our tests with 5 mounts and 3 quorum members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00

1 2 3 4 5 ...

1329 Commits