scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-03 10:55:20 +00:00

Author	SHA1	Message	Date
Zach Brown	3de703757f	Fix weird comment editing error That comment looked very weird indeed until I recognized that I must have forgotten to delete the first two attempts at starting the sentence. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 12:02:05 -07:00
Zach Brown	7d67489b0c	Handle resent initial client greetings The very first greeting a client sends is unique becuase it doesn't yet have a server_term field set and tells the server to create items to track the client. A server processing this request can create the items and then shut down before the client is able to receive the reply. They'll resend the greeting without server_term but then the next server will get -EEXIST errors as it tries to create items for the client. This causes the connection to break, which the client tries to reestablish, and the pattern repeats indefinitely. The fix is to simply recognize that -EEXIST is acceptable during item creation. Server message handlers always have to address the case where a resent message was already processed by a previous server but it's response didn't make it to the client. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 11:56:26 -07:00
Zach Brown	73084462e9	Remove unused client greeting_umb Remove an old client info field from the unmount barrier mechanism which was removed a while ago. It used to be compared to a super field to decide to finish unmount without reconnecting but now we check for our mounted_client item in the server's btree. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-16 10:04:42 -07:00
Andy Grover	efe5d92458	Reserve space in superblock for IPv6 addresses Define a family field, and add a union for IPv4 and v6 variants, although v6 is not supported yet. Family field is now used to determine presence of address in a quorum slot, instead of checking if addr is zero. Signed-off-by: Andy Grover <agrover@versity.com>	2021-03-12 14:10:42 -08:00
Zach Brown	513d6b2734	Merge pull request #20 from versity/zab/remove_trans_spinlock Zab/remove trans spinlock	2021-03-04 13:59:07 -08:00
Zach Brown	f8d39610a2	Only get inode writeback_lock when adding inodes Each transaction maintains a global list of inodes to sync. It checks the inode and adds it in each write_end call per OS page. Locking and unlocking the global spinlock was showing up in profiles. At the very least, we can only get the lock once per large file that's written during a transaction. This will reduce spinlock traffic on the lock by the number of pages written per file. We'll want a better solution in the long run, but this helps for now. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Zach Brown	c470c1c9f6	Allow read-mostly _alloc_meta_low Each transaction hold makes multiple calls to _alloc_meta_low to see if the transaction should be committed to refill allocators before the caller's hold is acquired and they can dirty blocks in the transaction. _alloc_meta_low was using a spinlock to sample the allocator list_head blocks to determine if there was space available. The lock and unlock stores were creating significant cacheline contention. The _alloc_meta_low calls are higher frequency than allocations. We can use a seqlock to have exclusive writers and allow concurrent _alloc_meta_low readers who retry if a writer intervenes. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-04 11:39:30 -08:00
Zach Brown	e163f3b099	Use atomic holders instead of trans info lock We saw the transaction info lock showing up in profiles. We were doing quite a lot of work with that lock held. We can remove it entirely and use an atomic. Instead of a locked holders count and writer boolean we can use an atomic holders and have a high bit indicate that the write_func is pending. This turns the lock/unlock pairs in hold and release into atomic inc/cmpxchg/dec operations. Then we were checking allocators under the trans lock. Now that we have an atomic holders count we can increment it to prevent the writer from commiting and release it after the checks if we need another commit before the hold. And finally, we were freeing our allocated reservation struct under the lock. We weren't actually doing anything with the reservation struct so we can use journal_info as the nested hold counter instead of having it point to an allocated and freed struct. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 14:18:04 -08:00
Zach Brown	a508baae76	Remove unused triggers As the implementation shifted away from the ring of btree blocks and LSM segments we lost callers to all these triggers. They're unused and can be removed. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:50:00 -08:00
Zach Brown	208c51d1d2	Update stale block reading test The previous test that triggered re-reading blocks, as though they were stale, was written in the era where it only hit btree blocks and everything else was stored in LSM segments. This reworks the test to make it clear that it affects all our block readers today. The test only exercise the core read retry path, but it could be expanded to test callers retrying with newer references after they get -ESTALE errors. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:50:00 -08:00
Zach Brown	9450959ca4	Protect stale block readers from local dirtying Our block cache consistency mechanism allows readers to try and read stale block references. They check block headers of the block they read to discover if it has been modified and they should retry the read with newer block references. For this to be correct the block contents can't change under the readers. That's obviously true in the simple imagined case of one node writing and another node reading. But we also have the case where the stale reader and dirtying writer can be concurrent tasks in the same mount which share a block cache. There were a two failure cases that derive from the order of readers and writers working with blocks. If the reader goes first, the writer could find the existing block in the cache and modify it while the reader assumes that it is read only. The fix is to have the writer always remove any existing cached block and insert a newly allocated block into the cache with the header fields already changed. Any existing readers will still have their cached block references and any new readers will see the modified headers and return -ESTALE. The next failure comes from readers trying to invalidate dirty blocks when they see modified headers. They assumed that the existing cached block was old and could be dropped so that a new current version could be read. But in this case a local writer has clobbered the reader's stale block and the reader should immediately return -ESTALE. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:59 -08:00
Zach Brown	6237f0adc5	Add _block_dirty_ref to dirty blocks in one place To create dirty blocks in memory each block type caller currently gets a reference on a created block and then dirties it. The reference it gets could be an existing cached block that stale readers are currently using. This creates a problem with our block consistency protocol where writers can dirty and modify cached blocks that readers are currently reading in memory, leading to read corruption. This commit is the first step in addressing that problem. We add a scoutfs_block_dirty_ref() call which returns a reference to a dirtied block from the block core in one call. We're only changing the callers in this patch but we'll be reworking the dirtying mechanism in an upcoming patch to avoid corrupting readers. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	0969a94bfc	Check one block_ref struct in block core Each of the different block types had a reading function that read a block and then checked their reference struct for their block type. This gets rid of each block reference type and has a single block_ref type which is then checked by a single ref reading function in the block core. By putting ref checking in the core we no longer have to export checking the block header crc, verifying headers, invalidating blocks, or even reading raw blocks themseves. Everyone reads refs and leaves the checking up to the core. The changes don't have a significant functional effect. This is mostly just changing types and moving code around. (There are some changes to visible counters.) This shares code, which is nice, but this is putting the block reference checking in one place in the block core so that in a few patches we can fix problems with writers dirtying blocks that are being read. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:17 -08:00
Zach Brown	b1b75cbe9f	Fix block cache shrink and read racing crash The block cache wasn't safely racing readers walking the rcu radix_tree and the shrinker walking the LRU list. A reader could get a reference to a block that had been removed from the radix and was queued for freeing. It'd clobber the free's llist_head union member by putting the block back on the lru and both the read and free would crash as they each corrupted each other's memory. We rarely saw this in heavy load testing. The fix is to clean up the use of rcu, refcounting, and freeing. First, we get rid of the LRU list. Now we don't have to worry about resolving racing accesses of blocks between two independent structures. Instead of shrinking walking the LRU list, we can mark blocks on access such that shrinking can walk all blocks randomly and expect to quickly find candidates to shrink. To make it easier to concurrently walk all the blocks we switch to the rhashtable instead of the radix tree. It also has nice per-bucket locking so we can get rid of the global lock that protected the LRU list and radix insertion. (And it isn't limited to 'long' keys so we can get rid of the check for max meta blknos that couldn't be cached.) Now we need to tighten up when read can get a reference and when shrink can remove blocks. We have presence in the hash table hold a refcount but we make it a magic high bit in the refcount so that it can be differentiated from other references. Now lookup can atomically get a reference to blocks that are in the hash table, and shrinking can atomically remove blocks when it is the only other reference. We also clean up freeing a bit. It has to wait for the rcu grace period to ensure that no other rcu readers can reference the blocks its freeing. It has to iterate over the list with _safe because it's freeing as it goes. Interestingly, when reworking the shrinker I noticed that we weren't scaling the nr_to_scan from the pages we returned in previous shrink calls back to blocks. We now divide the input from pages back into blocks. Signed-off-by: Zach Brown <zab@versity.com>	2021-03-01 09:49:15 -08:00
Zach Brown	336d521e44	Use spinlock to protect server farewell list We had a mutex protecting the list of farewell requests. The critical sections are all very short so we can use a spinlock and be a bit clearer and more efficient. While we're at it, refactor freeing to free outside of the criticial section. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	4fab75b862	Account for non-quorum in server farewell The server has to be careful to only send farewell responses to quorum clients once it knows that it won't need their vote to elect a leader to server remaining clients. The logic for doing this forgot to take non-quorum clients into account. It would send farewell requests to all the final majority of quorum members once they all tried to unmount. This could leave non-quorum clients hung in unmount trying to send their farewell requests. The fix is to count mouted_clients items for non-quorum clients and hold off on sending farewell requests to the final majority until those non-quorum clients have unmounted. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	d53350f9f1	Consistently lock server mounted_clients btree The mounted_clients btree stores items to track mounted clients. It's modified by multiple greeting workers and the farewell work. The greeting work was serialized by the farewell_mutex, but the modifications in the farewell thread weren't protected. This could result in modifications between the threads being lost if the dirty block reference updates raced in just the right way. I saw this in testing with deletions in farewell being lost and then that lingering item preventing unmount because the server thought it had to wait for a remaining quorum member to unmount. We fix this by adding a mutex specifically to protect the mounted_clients btree in the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	57f34e90e9	Use mounted_client item as sign of farewell As clients unmount they send a farewell request that cleans up persistent state associated with the mount. The client needs to be sure that it gets processed, and we must maintain a majority of quorum members mounted to be able to elect a server to process farewell requests. We had a mechanism using the unmount_barrier fields in the greeting and super_block to let the final unmounting quorum majority know that their farewells have been processed and that they didn't need to keep trying to reconnect. But we missed that we also need this out of band farewell handling signal for non-quorum member clients as well. The server can send farewells to a non-member client as well as the final majority and then tear down all the connections before the non-quorum client can see its farewell response. It also needs to be able to know that its farewell has been processed before the server let the final majority unmount. We can remove the custom unmount_barrier method and instead have all unmounting clients check for their mounted_client item in the server's btree. This item is removed as the last step of farewell processing so if the client sees that it has been removed it knows that it doesn't need to resend the farewell and can finish unmounting. This fixes a bug where a non-quorum unmount could hang if it raced with the final majority unmounting. I was able to trigger this hang in our tests with 5 mounts and 3 quorum members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	740e13e53a	Return error from _quorum_setup Well that's a silly mistake. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-22 13:28:38 -08:00
Zach Brown	8e34c5d66a	Use quorum slots and background election work Previously quorum configuration specified the number of votes needed to elected the leader. This was an excessive amount of freedom in the configuration of the cluster which created all sorts of problems which had to be designed around. Most acutely, though, it required a probabilistic mechanism for mounts to persistently record that they're starting a server so that future servers could find and possibly fence them. They would write to a lot of quorum blocks and trust that it was unlikely that future servers would overwrite all of their written blocks. Overwriting was always possible, which would be bad enough, but it also required so much IO that we had to use long election timeouts to avoid spurious fencing. These longer timeouts had already gone wrong on some storage configurations, leading to hung mounts. To fix this and other problems we see coming, like live membership changes, we now specifically configure the number and identity of mounts which will be participating in quorum voting. With specific identities, mounts now have a corresponding specific block they can write to and which future servers can read from to see if they're still running. We change the quorum config in the super block from a single quorum_count to an array of quorum slots which specify the address of the mount that is assigned to that slot. The mount argument to specify a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr" which specifies the mount's slot. The slot's address is used for udp election messages and tcp server connections. Now that we specifically have configured unique IP addresses for all the quorum members, we can use UDP messages to send and receive the vote mesages in the raft protocol to elect a leader. The quorum code doesn't have to read and write disk block votes and is a more reasonable core loop that either waits for received network messages or timeouts to advance the raft election state machine. The quorum blocks are now used for slots to store their persistent raft term and to set their leader state. We have event fields in the block to record the timestamp of the most recent interesting events that happened to the slot. Now that raft doesn't use IO, we can leave the quorum election work running in the background. The raft work in the quorum members is always running so we can use a much more typical raft implementation with heartbeats. Critically, this decouples the client and election life cycles. Quorum is always running and is responsible for starting and stopping the server. The client repeatedly tries to connect to a server, it has nothing to do with deciding to participate in quorum. Finally, we add a quorum/status sysfs file which shows the state of the quorum raft protocol in a member mount and has the last messages that were sent to or received from the other members. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-18 12:57:30 -08:00
Zach Brown	1c7bbd6260	More accurately describe unmounting quorum members As a client unmounts it sends a farewell request to the server. We have to carefully manage unmounting the final quorum members so that there is always a remaining quorum to elect a leader to start a server to process all their farewell requests. The mechanism for doing this described these clients as "voters". That's not really right, in our terminology voters and candidates are temporary roles taken on by members during a specific election term in the raft protocol. It's more accurate to describe the final set of clients as quorum members. They can be voters or candidates depending on how the raft protocol timeouts workout in any given election. So we rename the greeting flag, mounted client flag, and the code and comments on either side of the client and server to be a little clearer. This only changes symbols and comments, there should be no functional change. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-11 15:47:39 -08:00
Zach Brown	3ad18b0f3b	Update super blkno field tests for meta device As we read the super we check the first and last meta and data blkno fields. The tests weren't updated as we moved from one device to two metadata and data devices. Add a helper that tests the range for the device and test both meta and data ranges fully, instead of only testing the endpoints of each and assuming they're related because they're living on one device. Signed-off-by: Zach Brown <zab@versity.com>	2021-02-11 15:47:29 -08:00
Andy Grover	eea95357d3	Remove unused radix_block struct Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-26 16:07:05 -08:00
Zach Brown	ade539217e	Handle advance_seq being replayed in new server As a core principle, all server message processing needs to be safe to replay as servers shut down and requests are resent to new servers. The advance_seq handler got this wrong. It would only try to remove a trans_seq item for the seq sent by the client before inserting a new item for the next seq. This change could be committed before the reply was lost as the server shuts down. The next server would process the resent request but wouldn't find the old item for the seq that the client sent, and would ignore the new item that the previous server inserted. It would then insert another greater seq for the same client. This would leave behind a stale old trans_seq that would be returned as the last_seq which would forever limit the results that could be returned from the seq index walks. This fix is to always remove all previous seq items for the client before inserting a new one. This creates O(clients) server work, but it's minimal. This manifest as occasional simple-inode-index test failures (say 1 in 5?) which would trigger if the unmounts during previous tests would happen to have advance_seq resent across server shutdowns. With this change the test now reliably passes. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	e9c3aa6501	More carefully cancel server farewell work Farewell work is queued by farewell message processing. Server shutdown didn't properly wait for pending farewell work to finish before tearing down. As the server work destroyed the server's connection the farewell work could stlil be running and try to send responses down the socket. We make the server more carefully avoid queueuing farewell work if it's in the process of shutting down and wait for farewell work to finish before destroying the server's resources. This fixed all manner of crashes that were seen in testing when a bunch of nodes unmounted, creating farewell work on the server as it itself unmounted and destroyed the server. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	d39268bbc1	Fix spurious EIO from scoutfs_srch_get_compact scoutfs_srch_get_compact() is building up a compaction request which has a list of srch files to read and sort and write into a new srch file. It finds input files by searching for a sufficient number of similar files: first any unsorted log files and then sorted log files that are around the same size. It finds the files by using btree next on the srch zone which has types for unsorted srch log files, sorted srch files, but also pending and busy compaction items. It was being far too cute about iterating over different key types. It was trying to adapt to finding the next key and was making assumptions about the order of key types. It didn't notice that the pending and busy key types followed log and sorted and would generate EIO when it ran into them and found their value length didn't match what it was expecting. Rework the next item ref parsing so that it returns -ENOENT if it gets an unexpected key type, then look for the next key type when checking enoent. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	32e7978a6e	Extend lock invalidate grace period The grace period is intended to let lock holders squeeze in more bulk work before another node pulls the lock out from under them. The length of the delay is a balance between getting more work done per lock hold and adding latency to ping-ponging workloads. The current grace period was too short. To do work in the conflicting case you often have to read the result that the other mount wrote as you invalidated their lock. The test was written in the LSM world where we'd effectively read a single level 0 1MB segment. In the btree world we're checking bloom blocks and reading the other mount's btree. It has more dependent read latency. So we turn up the grace period to let conflicting readers squeeze in more work before pulling the lock out from under them. This value was chosen to make lock-conflicting-batch-commit pass in guests sharing nvme metadata devices in debugging kernels. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	da5911c311	Use d_materialise_unique to splice dir dentries When we're splicing in dentries in lookup we can be splicing the result of changes on other nodes into a stale dcache. The stale dcache might contain dir entries and the dcache does not allow aliased directories. Use d_materialise_unique() to splice in dir inodes so that we remove all aliased dentries which must be stale. We can still use d_splice_alias() for all other inode types. Any existing stale dentries will fail revalidation before they're used. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	098fc420be	Add some item cache page tracing Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Zach Brown	631801c45c	Don't queue lock invalidation work during shutdown The lock invalidation work function needs to be careful not to requeue itself while we're shutting down or we can be left with invalidation functions racing with shutdown. Invalidation calls igrab so we can end up with unmount warning that there are still inodes in use. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-26 14:46:07 -08:00
Andy Grover	355eac79d2	Retry if transaction cannot alloc for fallocate or write Add a new distinguishable return value (ENOBUFS) from allocator for if the transaction cannot alloc space. This doesn't mean the filesystem is full -- opening a new transaction may result in forward progress. Alter fallocate and get_blocks code to check for this err val and retry with a new transaction. Handling actual ENOSPC can still happen, of course. Add counter called "alloc_trans_retry" and increment it from both spots. Signed-off-by: Andy Grover <agrover@versity.com> [zab@versity.com: fixed up write_begin error paths]	2021-01-25 09:32:01 -08:00
Andy Grover	bed33c7ffd	Remove item accounting Remove kmod/src/count.h Remove scoutfs_trans_track_item() Remove reserved/actual fields from scoutfs_reservation Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-20 17:01:08 -08:00
Zach Brown	d64dd89ead	Fix item cache page memory corruption The item cache page life cycle is tricky. There are no proper page reference counts, everthing is done by nesting the page rwlock inside item_cache_info rwlock. The intent is that you can only reference pages while you hold the rwlocks appropriately. The per-cpu page references are outside that locking regime so they add a reference count. Now there are reference counts for the main cache index reference and for each per-cpu reference. The end result of all this is that you can only reference pages outside of locks if you're protected by references. Lock invalidation messed this up by trying to add its right split page to the lru after it was unlocked. Its page reference wasn't protected at this point. Shrinking could be freeing that page, and so it could be putting a freed page's memory back on the lru. Shrinking had a little bug that it was using list_move to move an initialized lru_head list_head. It turns out to be harmless (list_del will just follow pointers to itself and set itself as next and prev all over again), but boy does it catch one's eye. Let's remove all confusion and drop the reference while holding the cinf->rwlock instead of trying to optimize freeing outside locks. Finally, the big one: inserting a read item after compacting the page to make room was inserting into stale parent pointers into the old pre-compacted page, rather than the new page that was swapped in by compaction. This left references to a freed page in the page rbtree and hilarity ensued. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-20 09:02:29 -08:00
Andy Grover	d731c1577e	Filesystem version instead of format hash check Instead of hashing headers, define an interop version. Do not mount superblocks that have a different version, either higher or lower. Since this is pretty much the same as the format hash except it's a constant, minimal code changes are needed. Initial dev version is 0, with the intent that version will be bumped to 1 immediately prior to tagging initial release version. Update README. Fix comments. Add interop version to notes and modinfo. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-15 10:53:00 -08:00
Zach Brown	3139d3ea68	Add move_blocks ioctl Add a relatively constrained ioctl that moves extents between regular files. This is intended to be used by tasks which combine many existing files into a much larger file without reading and writing all the file contents. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	4da3d47601	Move ALLOC_DETAIL ioctl definition By convention we have the _IO* ioctl definition after the argument structs and ALLOC_DETAIL got it a bit wrong so move it down. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Zach Brown	aa1b1fa34f	Add util.h for kernel helpers Add a little header for inline convenience functions. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-14 13:42:22 -08:00
Andy Grover	7cac1e7136	Merge pull request #1 from agrover/use-argp Rework scoutfs command-line parsing	2021-01-13 11:14:08 -08:00
Andy Grover	2c5871c253	Change release ioctl to be denominated in bytes not blocks This more closely matches stage ioctl and other conventions. Also change release code to use offset/length nomenclature for consistency. Signed-off-by: Andy Grover <agrover@versity.com>	2021-01-12 16:29:42 -08:00
Zach Brown	fc003a5038	Consistently sample data alloc total_len With many concurrent writers we were seeing excessive commits forced because it thought the data allocator was running low. The transaction was checking the raw total_len value in the data_avail alloc_root for the number of free data blocks. But this read wasn't locked, and allocators could completely remove a large free extent and then re-insert a slightly smaller free extent as they perform their alloction. The transaction could see a temporary very small total_len and trigger a commit. Data allocations are serialized by a heavy mutex so we don't want to have the reader try and use that to see a consistent total_len. Instead we create a data allocator run-time struct that has a consistent total_len that is updated after all the extent items are manipulated. This also gives us a place to put the caller's cached extent so that it can be included in the total_len, previously it wasn't included in the free total that the transaction saw. The file data allocator can then initialize and use this struct instead of its raw use of the root and cached extent. Then the transaction can sample its consistent total_len that reflects the root and cached extent. A subtle detail is that fallocate can't use _free_data to return an allocated extent on error to the avail pool. It instead frees into the data_free pool like normal frees. It doesn't really matter that this could prematurely drain the avail pool because it's in an error path. Signed-off-by: Zach Brown <zab@versity.com>	2021-01-06 09:25:32 -08:00
Zach Brown	1e0f8ee27a	Finally change all 'ci' inode info ptrs to 'si' Finally get rid of the last silly vestige of the ancient 'ci' name and update the scoutfs_inode_info pointers to si. This is just a global search and replace, nothing functional changes. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-15 15:20:02 -08:00
Zach Brown	807ae11ee9	Protect per-inode extent items with extent_sem Now that we have full precision extents a writer with i_mutex and a page lock can be modifying large extent items which cover much of the surrounding pages in the file. Readers can be in a different page with only the page lock and try to work with extent items as the writer is deleting and creating them. We add a per-inode rwsem which just protects file extent item manipulation. We try to acquire it as close to the item use as possible in data.c which is the only place we work with file extent items. This stops rare read corruption we were seeing where get_block in a reader was racing with extent item deletion in a stager at a further offset in the file. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-15 11:56:50 -08:00
Zach Brown	7ca3672a67	Update repo README.md, remove from kmod Move the main scoutfs README.md from the old kmod/ location into the top of the new single repository. We update the language and instructions just a bit to reflect that we can checkout and build the module and utilities from the single repo. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-07 10:39:20 -08:00
Zach Brown	aa6e210ac7	Fix kmod spec path in dist tarball For some reason, the make dist rule in kmod/ put the spec file in a scoutfs-$ver/ directory, instead of scoutfs-kmod-$ver/ like the rest of the files and instead of scoutfs-utils-$ver/ that the spec file for utils is put in the utils dist tarball. This adds -kmod to the path for the spec file so that it matches the rest of the kmod dist tarball. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-07 10:39:20 -08:00
Zach Brown	e2dfffcab9	scoutfs: search_xattrs name requires srch tag The search_xattrs ioctl is only going to find entries for xattrs with the .srch. tag which create srch entries as they're created and destroyed. Export the xattr tag parsing so that the ioctl can return -EINVAL for xattrs which don't have the scoutfs prefix and the .srch. tag. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	f0ddf5ff04	scoutfs: search_xattrs returns each ino once Hash collisions can lead to multiple xattr ids in an inode being found for a given name hash value. If this happens we only want to return the inode number once. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	18aee0ebbd	scoutfs: fix lost entries in resumed srch compact Compacting very large srch files can use all of a given operation's metadata allocator. When this happens we record the position in the srch files of the compcation in the pending item. We could lose entries when this happens because the kway_next callback would advance the srch file position as it read entries and put them in the tournament tree leaves, not as it put them in the output file. We'd continue from the entries that were next to go in the tournament leaves, not from what was in the leaves. This refactors the kway merge callbacks to differentiate between getting entries at the position and advancing the positions. We initialize the tournament leaves by getting entries at the positions and only advance the position as entries leave the tournament tree and are either stored in the output srch files or are dropped. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	c35f1ff324	scoutfs: inc end when search xattrs retries In the rare case that searching for xattrs only finds deletions within its window it retries the search past the window. The end entry is inclusive and is the last entry that can be returned. When retrying the search we need to start from the entry after that to ensure forward progress. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	6770a31683	scoutfs: consistently trim srch entry range We have to limit the number of srch entries that we'll track while performing a search for all the inodes that contain xattrs that match the search hash value. As we hit the limit on the number of entries to track we have to drop entries. As we drop entries we can't return any inodes for entries past the dropped entries. We were updating the end point of the search as we dropped entries past the tracked set, but we weren't updating the search end point if we dropped the last currently tracked entry. And we were setting the end point to the dropped entry, not to the entry before it. This could lead us to spuriously returning deleted entries if we drop the creation entry and then allow tracking its deletion later. This fixes both those problems. We now properly set the end point to just before the dropped entry for all entries that we drop. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00
Zach Brown	9395360324	scoutfs: add srch entry inc/dec We're going to need to increment and decrement srch entries in coming fixes. Signed-off-by: Zach Brown <zab@versity.com>	2020-12-03 09:58:35 -08:00

1 2 3 4 5 ...

964 Commits