scoutfs

mirror of https://github.com/versity/scoutfs.git synced 2026-01-08 04:55:21 +00:00

Author	SHA1	Message	Date
Zach Brown	3efec0c094	scoutfs: add scoutfs_set_max_key() It's nice to have a helper that sets the max possible key instead of messing around with memset or ~0 manually. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:24:10 -07:00
Zach Brown	ae748f0ebc	scoutfs: allow tracing with a null sb The sb counter field isn't necessary, allow a null sb pointer arg which then results in a counter output of 0. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-05 14:20:19 -07:00
Zach Brown	59b1f62df8	scoutfs: add basic xattr support Add basic support for extended attributes. The next steps are to add support for more prefixes, including ACLs, and to properly delete them on unlink. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:59:43 -07:00
Zach Brown	cedeacacb8	scoutfs: add file with simple name functions Directory entries and extended attributes similarly hash and compare strings so we'll give them some shared functions for doing so. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:49:41 -07:00
Zach Brown	a64ca8018a	scoutfs: add scoutfs_btree_hole() for finding keys Directory entries found a hole in the key range between the first and last possible hash value for a new entry's key. The xattrs want to do the same thing so let's extract this into a proper function. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-04 10:45:17 -07:00
Zach Brown	5c7ba5ed39	scoutfs: remove wrlock and roster These were interesting experiments in how to manage locks across the cluster but we'll be going in a more flexible direction. Signed-off-by: Zach Brown <zab@versity.com>	2016-07-01 21:03:40 -07:00
Zach Brown	4689bf0881	scoutfs: free once granted wrlock entries The entry free routine only frees entries that don't have any references from its context. Callers are supposed to try to free entries after removing references to them. Callers that were removing entries from a shard's granted pointer were trying to free the entry before removing the pointer to the entry. Entries that were last removed from shard granted pointers were never freed. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 21:17:30 -07:00
Zach Brown	171aea62cd	scoutfs: add some wrlock tracing Add a bunch of tracing to the wrlock code paths. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 21:10:16 -07:00
Zach Brown	c9caebc117	scoutfs: remove unused held_trans The held lock struct had an unused 'held_trans' field from a previous version of the code that specifically tried to track if a held lock had the trans open. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 21:07:05 -07:00
Zach Brown	ad5a58c348	scoutfs: make trace format a little nicer The first trace format was pretty noisy. Now the time is printed in a gettimeofday timeval so that it can be correlated with other time stamps. The super block gets a counter instead of a pointer. The pid and cpu are printed without a lavel and we add the line number so that it's easy to grep the source to find a trace caller. Signed-off-by: Zach Brown <zab@versity.com>	2016-06-01 20:34:31 -07:00
Zach Brown	7d6dd91a24	scoutfs: add tracing messages This adds tracing functionality that's cheap and easy to use. By constantly gathering traces we'll always have rich history to analyze when something goes wrong. Signed-off-by: Zach Brown <zab@versity.com>	2016-05-28 11:11:15 -07:00
Zach Brown	0820a7b5bd	scoutfs: introduce write locking Introduce the concept of acquiring write locks around write operations. The core idea is that reads are unlocked and that write lock contention between nodes should be rare. This first pass simply broadcasts write lock requests to all the mounts in the volume. It achieves a reasonable degree of fairness and doesn't require centralizing state in a lock server. We have to flesh out a bit of initial infrastructure to support the write locking protocol. The roster manages cluster membership and messaging and only understands mounts in the same kernel for now. Creation needs to know which inodes to try and lock so we see the start of per-mount free inode reservations. The transformation of users is straight forward: they aquire the write lock on the inodes they're working with instead of holding a transaction. The write lock machinery now manages transactions. This passes single mount testing but that isn't saying much. The next step is to run multi-mount tests. Signed-off-by: Zach Brown <zab@versity.com>	2016-05-23 17:25:06 -07:00
Zach Brown	4163236fc1	scoutfs: dirent hashes use linear probing The current mechanism for dealing with dirent name hash collisions is to use multiple hash functions. This won't work great with the btree where it's expensive to search multiple distant items for a given entry. Instead of having multiple full precision functions we linearly probe a given number of hash values after the initial name hash. Now the slow colliding path walks adjacent items in the tree instead of bouncing around the tree. Signed-off-by: Zach Brown <zab@versity.com>	2016-05-02 21:55:39 -07:00
Zach Brown	e0f38231b3	scoutfs: store next allocated inode in super The next inode number to be allocated has been stored only in the in-memory super block and hasn't survived across mounts. This sometimes accidentally worked if the tests removed the initial inodes but often would cause failures when inode allocation returned existing inodes. This tracks the next inode to allocate in the super block and maintains it across mounts. Tests now consistently pass as inode allocations consistently return free inode numbers. Signed-off-by: Zach Brown <zab@versity.com>	2016-05-01 09:16:40 -07:00
Zach Brown	979a36e175	scoutfs: add buddy block allocator Add the block allocator. Logically we use a buddy allocator that's built from bitmaps for allocators of each order up to the largest allocator that fits in the device. This ends up using two bits per block. On disk we log modified regions of these bitmaps in chunks in blocks in a preallocated ring. We carefully coordinate logging the chunks and the ring size so that we can always write to the tail of the ring. There's one allocator and it's only read on mount today. We'll eventually have multiple of these allocators covering the device and nodes will coordinate exclusive access to them. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-30 12:20:18 -07:00
Zach Brown	e3b308c0d0	scoutfs: add transactions and metadata writing Add the transaction machinery that writes out dirty metadata blocks as atomic transactions. The block radix tracks dirty blocks with a dirty radix tag. Blocks are written with bios whose completion marks them clean and propogates errors through the super info. The blocks are left tagged during writeout so that they won't be (someday) mistaken for clean by eviction. Since we're modifying the radix from io completion we change all block lock acquisitions to be interrupt safe. All the operations that modify blocks hold and release the transaction while they're doing their work. sync kicks off work that waits for the transaction to be released so that it can write out all the dirty blocks and then the new supers that reference them. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-14 14:35:32 -07:00
Zach Brown	a2f55f02a1	scoutfs: avoid stale btree block pointer The btree walk was storing a pointer to the current btree block that it was working on. It later used this when the walk continues and the block becomes a parent. But it didn't update this pointer if splitting changed the block to traverse. By removing this pointer and using the block data pointers directly we remove the risk of the pointer going stale. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-14 14:32:21 -07:00
Zach Brown	1c284af854	scoutfs: add assertions for bad treap offsets The wild casting in the treap code can cause memory corruption if it's fed bad offsets. Add some assertions so that we can see when this is happening. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-14 13:15:12 -07:00
Zach Brown	3e5eeaa80c	scoutfs: initialize block alloc past mkfs blocks The format doesn't yet record free blocks. We've been relying on the scary initialization of the block allocator past the blocks that are written by mkfs. And it was wrong. This garbage will be replaced with an allocator in a few commits. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-14 13:13:18 -07:00
Zach Brown	5d77fa4f18	scoutfs: fix serious but small btree bugs Not surprisingly, testing the btree code shook out a few bugs - the treap root wasn't initialized - existing split source block wasn't compacted - item movement used item treap fields after deletion All of these had the consequence of feeding the treap code bad offsets so its node/u16 casts could lead it to scribble over memory. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-14 13:08:56 -07:00
Zach Brown	0234abf098	scoutfs: update filerw cursor use The conversion of the filerw item callers of the btree cursor wasn't updated to consistently release the cursors. This was causing block refcounting problems that could scribble on freed and realloced memory. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-13 09:49:13 -07:00
Zach Brown	affee9da7c	scoutfs: add cscope noise to .gitignore Signed-off-by: Zach Brown <zab@versity.com>	2016-04-12 19:37:31 -07:00
Zach Brown	5651d48c18	scoutfs: add core btree functionality Previously we had stubbed out the btree item API with static inlines. Those are replaced with real functions in a reasonably functional btree implementation. The btree implementation itself is pretty straight forward. Operations are performed top-down and we dirty, lock, and split/merge blocks as we go. Callers are given a cursor to give them full access to the item. Items in the btree blocks are stored in a treap. There are a lot of comments in the code to help make things clear. We add the notion of block references and some block functions for reading and dirtying blocks by reference. This passes tests up to the point where unmount tries to write out data and the world catches fire. That's far enough to commit what we have and iterate from there. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-12 19:33:09 -07:00
Zach Brown	5369fa1e05	scoutfs: first step towards multiple btrees Starting to implement LSM merging made me really question if it is the right approach. I'd like to try an experiment to see if we can get our concurrent writes done with much simpler btrees. This commit removes all the functionality that derives from the large LSM segments and distributing the manifest. What's left is a multi-page block layer and the husk of the btree implementation which will give people access to items. Callers that work with items get translated to the btree interface. This gets as far as reading the super block but the format changes and large block size mean that the crc check fails and the mount returns an error. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-11 11:35:37 -07:00
Zach Brown	a07b41fa8b	scoutfs: store the manifest in an interval tree Now that we have the interval tree we can use it to store the manifest. Instead of having different indexes for each level we store all the levels in one index. This simplifies the code quite a bit. In particular, we won't have to special case merging between level 0 and 1 quite as much because level 0 is no longer a special list. We have a strong motivation to keep the manifest small. So we get rid of the blkno radix. It wasn't wise to trade off more manifest storage to make the ring a bit smaller. We can store full manifests in the ring instead of just the block numbers. We rework the new_manifest interace that adds a final manifest entry and logs it. The ring entry addition and manifest update are atomic. We're about to implement merging which will permute the manifest. Read methods won't be able to iterate over levels while racing with merging. We change the manifest key search interface to return a full set of all the segments that intersect the key. The next item interface now knows how to restart the search if hits the end of a segment on one level and the next least key is in another segment and greater than the end of that completed segment. There was also a very crazy cut+paste bug where next item was testing that the item is past the last search key with a while instead of an if. It'd spin throwing list_del_init() and brelse() debugging warnings. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-02 17:27:58 -07:00
Zach Brown	20cc8c220c	scoutfs: fix next ival busy loop The next interval interface didn't set the ival to return to null when it finds a null next node. The caller would continuously get the same interval. This is what I get for programming late at night, I guess. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-02 17:17:19 -07:00
Zach Brown	eb790a7761	scoutfs: remove nonsense comment I think the range comparisons are correct here. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-02 17:16:51 -07:00
Zach Brown	d91dc45368	scoutfs: add interval tree Add an interval tree that lets us efficiently discover intervals that overlap a given search region. We're going to need this now to sanely implementing merging and in the future to implement granting access ranges. It's easy to implement an interval tree by using the kernel's augmented rbtree to track the max end value of the subtree of intervals. The tricky bit is that the augmented interface assumes that it can directly compare the augmented value. If we were developing against mainline we'd just patch the interface. But we're developing against distro kernels that development partners deploy so the kernel is frozen in amber. We deploy a giant stinky hack to import a private tweaked version of the interface. It's isolated so we can trivially drop it once we merge with the fixed upstream interface. We also add some build time checks to make sure that we don't accidentally combine rb structures between the private import and the main kernel interface. Signed-off-by: Zach Brown <zab@versity.com>	2016-04-01 14:53:06 -07:00
Zach Brown	7a565a69df	scoutfs: add percpu coutners with sysfs files Add percpu counters that will let us track all manner of things. To report them we add a sysfs directory full of attribute files in a sysfs dir for each mount: # (cd /sys/fs/scoutfs/loop0/counters && grep . *) skip_delete:0 skip_insert:3218 skip_lookup:8439 skip_next:1190 skip_search:156 The implementation is careful to define each counter in only one place. We don't have to make sure that a bunch of defintions and arrays are in sync. This builds off of Ben's initial patches that added sysfs dirs. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Ben McClelland <ben.mcclelland@versity.com>	2016-03-31 16:44:37 -07:00
Zach Brown	6e20913661	scoutfs: insert new manifests at highest level Manifests for newly written segments can be inserted at the highest level that doesn't have segments they intersect. This avoids ring and merging churn. The change cleans up the code a little bit, which is nice, and adds tracepoints for manifests entering and leaving the in memory structures. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-29 16:15:09 -07:00
Zach Brown	52c315942f	scoutfs: update item block and manifest item range The manifests for level 0 blocks always claimed that they could contain all keys. That causes a lot of extra bloom filter lookups when in fact the blocks contain a very small range of keys. It's true that we don't know what items a dirty segment is going to contain, don't want to update the manfiest at every insertion, and have to find the items in the segments in regular searching. But when they're finalized we know the items they'll contain and can update the manifest. We do that by initializing the item block range to nonsense and extending it as items are added. When it's finalized we update the manifest in memory and in the ring. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-29 11:27:27 -07:00
Zach Brown	97e6c1e605	scoutfs: fix final overlapping item/val Item headers are written from the front of the block to the tail. Item values are written from the tail of the block towards the head. The math to detect their overlapping in the center forgot to take the length of the item header into account. We could have final item headers and values overriding each other which causes file data to appear as an item header. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-29 11:25:36 -07:00
Zach Brown	c7c8969704	scoutfs: adjust bloom size for segment item max The bloom filter was much too large for the current typical limit on the number of items that fit in a segment. Having them too large decreases storage efficiency, has us read more data from a cold cache, and bloom tests pin too much data. We can cut it down to 25% for our current segment and item sizes. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-29 10:08:37 -07:00
Zach Brown	f1b5eb8a80	scoutfs: more dirty segment locking The segment code wasn't always locking around concurrent accesses to the dirty segment. This is mostly a problem for updating all the next elements in skip list modification. But we also want to serialize dirty block writing. Add a little helper function to acquire the dirty mutex when we're reading from the current dirty segment. Bring sync in to segment.c so it's clear that it's intimately related to the dirty segment. The item deletion hack was totally unlocked. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-27 19:29:38 -07:00
Zach Brown	9c3918b576	scoutfs: remove accidentally committed notes Some brainstorming notes in a comment accdentally made their way in to a commit. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-27 16:19:19 -07:00
Zach Brown	eff3d78cb1	scoutfs: update inode when write changes i_size Extended file data wasn't persistent because we weren't writing out the inode with the i_size update that covered the newly written data. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-26 22:28:45 -07:00
Zach Brown	059212d50e	scoutfs: add some basic tracepoints I added these tracepoints to verify that file data isn't reachable after mount because we're not writing out the inode with the current i_size. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-26 22:28:42 -07:00
Zach Brown	402dd2969f	scoutfs: add tracepoint support with bloom example Add the intrastucture for tracepoints. We include an example user that traces bloom filter hits and misses. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-26 20:58:31 -07:00
Zach Brown	9cf87ee571	scoutfs: add basic file page cache read and write Add basic file data support by implementing the address space file and page read and write methods. This passis basic read/write tests but is only the seed of a final implementation. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-26 10:58:06 -07:00
Zach Brown	867d717d2b	scoutfs: item offsets need to skip block headers The vallue offset allocation knew to skip block headers at the start of each segment block but, weirdly, the item offset allocation didn't. We make item offset calculation skip the header and we add some tracing to help see the problem. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-25 19:28:21 -07:00
Zach Brown	6834100251	scoutfs: free our dentry info Stop leaking dentry_info allocations by adding a dentry_op with a d_release that frees our dentry info allocation. rmmod tests no longer fail when dmesg screams that we have slab caches that still have allocated objects. Signed-off-by: Zach Brown <zab@versity.com> s	2016-03-25 11:08:20 -07:00
Zach Brown	434cbb9c78	scoutfs: create dirty items for inode updates Inode updates weren't persistent because they were being stored in clean segments in memory. This was triggered by the new hashed dirent mechanism returning -ENOENT when the inode still had a 0 max dirent hash nr. We make sure that there is a dirty item in the dirty segment at the start of inode modification so that later updates will store in the dirty segment. Nothing ensures that the dirty segment won't be written out today but that will be added soon. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-25 10:08:34 -07:00
Zach Brown	3bb00fafdc	scoutfs: require sparse builds Now that we know that it's easy to fix sparse build failures against RHEL kernel headers we can require sparse builds when developing. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-24 21:45:08 -07:00
Zach Brown	fbbfac1b27	scoutfs: fix sparse errors I was building against a RHEL tree that broke sparse builds. With that fixed I can now see and fix sparse errors. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-24 21:44:42 -07:00
Zach Brown	3755adddd5	scoutfs: store dirents at multiple hash values Previously we dealt with colliding dirent hash values by storing all the dirents that share a hash value in a big item with multiple dirents. This complicated the code and strongly encouraged resizing items as dirents come and go. Resizing items isn't very easy with our simple log segment item creation mechanism. Instead let's deal with collisions by allowing a dirent to be stored at multiple hash values. The code is much simpler. Lookup has to iterate over all possible hash values. We can track the greatest hash iteration stored in the directory inode to limit the overhead of negative lookups in small directories. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-24 20:11:58 -07:00
Zach Brown	1270553f1f	scoutfs: mega item access omnibus commit 9000 Initially items were stored in memory with an rbtree. That let us build up the API above items without worrying about their storage. That gave us dirty items in memory and we could start working on writing them to and reading them from the log segment blocks. Now that we have the code on either side we can get rid of the item cache in between. It had some nice properties but it's fundamentally duplicating the item storage in cached log segment blocks. We'd also have to teach it to differentiate between negative cache entries and missing entries that need to be filled from blocks. And the giant item index becomes a bottleneck. We have to index items in log segments anyway so we rewrite the item APIs to read and write the items in the log segments directly. Creation writes to dirty blocks in memory and reading and iteration walk through the cached blocks in the buffer cache. I've tried to comment the files and functions appropriately so most of the commentary for the new methods is in the body of the commit. The overall theme is making it relatively efficient to operate on individual items in log segments. Previously we could only walk all the items in an existing segment or write all the dirty items to a new segment. Now we have bloom filters and sorted item headers to let us test for the presence of an item's key with progressively more expensive methods. We hold on to a dirty segment and fill it as we create new items. This needs more fleshing out and testing but this is a solid first pass and it passes our existing tests. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-24 17:40:14 -07:00
Zach Brown	12d5d3d216	scoutfs: add next item reading Add code to walk all the block segments that intersect a key range to find the next item after that key value. It is easier to just return failure from the next item reader and have the caller retry the searches so we change the specific item reading path to use the same convention to keep the caller consistent. This still warns as it falls off the last block but that's fine for now. We're going to be changing all this in the next few commits. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-18 17:30:39 -07:00
Zach Brown	af492a9f27	scoutfs: add scoutfs_inc_key() Add a quick inline function for incrementing a key value across the inode>type>offset sorted key space. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-18 17:24:12 -07:00
Zach Brown	96b8a6da46	scoutfs: update created inode times in mknod In mknod the newly created inode's times are set down in the new inode creation path instead of up in the mknod path to match the parent dir's ctime and mtime. This is strictly legal but it's easier to test that all the times have been set in the mknod by having them equal. This stops mkdir-interface test failures when enough time passes between inode creation and parent dir timestamp updates to have them differ. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-18 17:21:12 -07:00
Zach Brown	0c0f2b19d5	scoutfs: update dirty inode items Wire up the code to update dirty inode items as inodes are modified in memory. We had a bit of the code but it wasn't being called. Signed-off-by: Zach Brown <zab@versity.com>	2016-03-17 19:12:49 -07:00

1 2 3

113 Commits