The vallue offset allocation knew to skip block headers at the start of
each segment block but, weirdly, the item offset allocation didn't.
We make item offset calculation skip the header and we add some tracing
to help see the problem.
Signed-off-by: Zach Brown <zab@versity.com>
Stop leaking dentry_info allocations by adding a dentry_op with a
d_release that frees our dentry info allocation. rmmod tests no longer
fail when dmesg screams that we have slab caches that still have
allocated objects.
Signed-off-by: Zach Brown <zab@versity.com>
s
Inode updates weren't persistent because they were being stored in clean
segments in memory. This was triggered by the new hashed dirent
mechanism returning -ENOENT when the inode still had a 0 max dirent hash
nr.
We make sure that there is a dirty item in the dirty segment at the
start of inode modification so that later updates will store in the
dirty segment. Nothing ensures that the dirty segment won't be written
out today but that will be added soon.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we know that it's easy to fix sparse build failures against
RHEL kernel headers we can require sparse builds when developing.
Signed-off-by: Zach Brown <zab@versity.com>
I was building against a RHEL tree that broke sparse builds. With that
fixed I can now see and fix sparse errors.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for printing dirent items to scoutfs print. We're careful
to change non-printable characters to ".".
Signed-off-by: Zach Brown <zab@versity.com>
Update print to show the inode fields in the newer dirent hashing
scheme. mkfs doesn't create directory entries.
Signed-off-by: Zach Brown <zab@versity.com>
Previously we dealt with colliding dirent hash values by storing all the
dirents that share a hash value in a big item with multiple dirents.
This complicated the code and strongly encouraged resizing items as
dirents come and go. Resizing items isn't very easy with our simple log
segment item creation mechanism.
Instead let's deal with collisions by allowing a dirent to be stored at
multiple hash values. The code is much simpler.
Lookup has to iterate over all possible hash values. We can track the
greatest hash iteration stored in the directory inode to limit the
overhead of negative lookups in small directories.
Signed-off-by: Zach Brown <zab@versity.com>
Initially items were stored in memory with an rbtree. That let us build
up the API above items without worrying about their storage. That gave
us dirty items in memory and we could start working on writing them to
and reading them from the log segment blocks.
Now that we have the code on either side we can get rid of the item
cache in between. It had some nice properties but it's fundamentally
duplicating the item storage in cached log segment blocks. We'd also
have to teach it to differentiate between negative cache entries and
missing entries that need to be filled from blocks. And the giant item
index becomes a bottleneck.
We have to index items in log segments anyway so we rewrite the item
APIs to read and write the items in the log segments directly. Creation
writes to dirty blocks in memory and reading and iteration walk through
the cached blocks in the buffer cache.
I've tried to comment the files and functions appropriately so most of
the commentary for the new methods is in the body of the commit.
The overall theme is making it relatively efficient to operate on
individual items in log segments. Previously we could only walk all the
items in an existing segment or write all the dirty items to a new
segment. Now we have bloom filters and sorted item headers to let us
test for the presence of an item's key with progressively more expensive
methods. We hold on to a dirty segment and fill it as we create new
items.
This needs more fleshing out and testing but this is a solid first pass
and it passes our existing tests.
Signed-off-by: Zach Brown <zab@versity.com>
The bloom filter had two bad bugs.
First the calculation was adding the bit width of newly hashed data to
the hash value instead of the record of the hashed bits available.
And the block offset calculation for each bit wasn't truncated to the
number of bloom blocks. While fixing this we can clean up the code and
make it faster by recording the bits in terms of their block and bit
offset instead of their large bit value.
Signed-off-by: Zach Brown <zab@versity.com>
The swizzle value was defined in terms of longs but the code used u64s.
And the bare shifted value was an int so it'd get truncated. Switch it
all to using longs.
The ratio of bugs to lines of code in that first attempt was through the
roof!
Signed-off-by: Zach Brown <zab@versity.com>
Update to the format rev which has large log segments that start with
bloom filter blocks, have items linked in a skip list, and item values
stored at offsets in the block.
Signed-off-by: Zach Brown <zab@versity.com>
pseudo_random_bytes() was accidentally copying the last partial long to
the beggining of the buffer instead of the end. The final partial long
bytes weren't being filled.
Signed-off-by: Zach Brown <zab@versity.com>
mkfs just needs to initialize bloom filter blocks with the bits for the
single root inode key. We can get away with these skeletal functions
for now.
Signed-off-by: Zach Brown <zab@versity.com>
We're going to need to start setting bloom filters bits in mkfs so we'll
add this trivial inline. It might grow later.
Signed-off-by: Zach Brown <zab@versity.com>
Add code to walk all the block segments that intersect a key range to
find the next item after that key value.
It is easier to just return failure from the next item reader and have
the caller retry the searches so we change the specific item reading
path to use the same convention to keep the caller consistent.
This still warns as it falls off the last block but that's fine for now.
We're going to be changing all this in the next few commits.
Signed-off-by: Zach Brown <zab@versity.com>
In mknod the newly created inode's times are set down in the new inode
creation path instead of up in the mknod path to match the parent dir's
ctime and mtime.
This is strictly legal but it's easier to test that all the times have
been set in the mknod by having them equal. This stops mkdir-interface
test failures when enough time passes between inode creation and parent
dir timestamp updates to have them differ.
Signed-off-by: Zach Brown <zab@versity.com>
Wire up the code to update dirty inode items as inodes are modified in
memory. We had a bit of the code but it wasn't being called.
Signed-off-by: Zach Brown <zab@versity.com>
Add a sync_fs method that writes dirty items into level 0 item blocks.
Add chunk allocator code to allocate new item blocks in free chunks. As
the allocator bitmap is modified it adds bitmap entries to the ring.
As new item blocks are allocated we create manifest entries that
describe their block location and keys. The entry is added to the
in-memory manifest and to entries in the ring.
This isn't complete and there's still bugs but this is enough to start
building on.
Signed-off-by: Zach Brown <zab@versity.com>
The initial bitmap entry written in the ring by mkfs was off by one.
Three chunks were written but the 0th chunk is also free for the supers.
It has to mark the first four chunks as allocated.
Signed-off-by: Zach Brown <zab@versity.com>
In the first pass we'd only printed the first map and ring blocks.
This reads the number of used map blocks into an allocation large enough
for the maximum number of map blocks.
Then we use the block numbers from the map blocks to print the active
ring blocks which are described by the super.
Signed-off-by: Zach Brown <zab@versity.com>
I had forgotten to translate from the scoutfs types in items to the vfs
types for filldir() so userspace was seeing garbage d_type values.
Signed-off-by: Zach Brown <zab@versity.com>
The migration from the new iterator interface in upstream to the old
readdir interface in rhel7 got the sense of the filldir return code
wrong. Any readdir would deadlock livelock as the dot entry was
returned at offset 0 without advancing f_pos.
Signed-off-by: Zach Brown <zab@versity.com>
We had forgotten to actually insert manifest nodes in to the blkno
radix. This hasn't mattered yet because there's only been one manifest
in the level 0 list.
Signed-off-by: Zach Brown <zab@versity.com>
Add the most basic ability to read items from log segment blocks. If
an item isn't in the cache then we walk segments in the manifest and
check for the item in each one.
This is just the core fundamental code. There's still a lot to do:
basic corruption validation, multi-block segments, bloom filters and
arrays to optimize segment misses, and some day the ability to read file
data items directly into page cache pages. The manifest locking is also
super broken.
But this is enough to let us mount and stat the root inode!
Signed-off-by: Zach Brown <zab@versity.com>
Read the ring described by the super block and replay its entries to
rebuild the in-memory state of the chunk allocator and log segment
manifest.
We add just enough of the chunk allocator to set the free bits to the
contents of the ring bitmap entries.
We start to build out the basic manifest data structure. It'll
certainly evolve when we later add code to actually query it.
Signed-off-by: Zach Brown <zab@versity.com>
Update to the format.h from the recent -utils changes that moved from
the clumsy 'brick' terminology to the more reasonable
'block/chunk/segment' terminology.
Signed-off-by: Zach Brown <zab@versity.com>
The sync implementation was a quick demonstration of packing items in to
large log blocks. We'll be doing things very differently in the actual
system. So tear this code out so we can build up more functional
structures. It'll still be in revision control so we'll be able
to reuse the parts that make sense in the new code.
Signed-off-by: Zach Brown <zab@versity.com>
The use of 'log' for all the large sizes was pretty confusing. Let's
use 'chunk' to describe the large alloc size. Other things live in them
as well as logs. Then use 'log segment' to describe the larger log
structure stored in a chunk that's made up of all the little blocks.
Get rid of the explicit distinction between brick and block numbers.
The format is now defined it terms of fixed 4k blocks. Logs become a
logical structure that's made up of a fixed number of blocks. The
allocator still manages large log sized regions.
Now that we have a working userspace mkfs we can read the supers on
mount instead of always initializing a new file system. We still don't
know how to read items from blocks so mount fails when it can't find the
root dir inode.
Signed-off-by: Zach Brown <zab@versity.com>
The format was updated while implementing mkfs and print in
scoutfs-utils. Bring the kernel code up to speed.
For some reason I changed the name of the item length in the item header
struct. Who knows.
Signed-off-by: Zach Brown <zab@versity.com>
Add a message printing function whose output includes the device and
major:minor and which handles the kernel level string prefix.
Signed-off-by: Zach Brown <zab@versity.com>
This is the initial commit of the repo that will track development
against distro kernels.
This is an import of a prototype branch in the upstream kernel that only
had a few initial commits. It needed to move to the old readdir
interface and use find_or_create_page() instead of pagecache_get_page()
to build in older distro kernels.