Add an inode creation time field. It's created for all new inodes.
It's visible to stat_more. setattr_more can set it during
restore.
Signed-off-by: Zach Brown <zab@versity.com>
Returning ENOSPC is challenging because we have clients working on
allocators which are a fraction of the whole and we use COW transactions
so we need to be able to allocate to free. This adds support for
returning ENOSPC to client posix allocators as free space gets low.
For metadata, we reserve a number of free blocks for making progress
with client and server transactions which can free space. The server
sets the low flag in a client's allocator if we start to dip into
reserved blocks. In the client we add an argument to entering a
transaction which indicates if we're allocating new space (as opposed to
just modifying existing data or freeing). When an allocating
transaction runs low and the server low flag is set then we return
ENOSPC.
Adding an argument to transaciton holders and having it return ENOSPC
gave us the opportunity to clean it up and make it a little clearer.
More work is done outside the wait_event function and it now
specifically waits for a transaction to cycle when it forces a commit
rather than spinning until the transaction worker acquires the lock and
stops it.
For data the same pattern applies except there are no reserved blocks
and we don't COW data so it's a simple case of returning the hard ENOSPC
when the data allocator flag is set.
The server needs to consider the reserved count when refilling the
client's meta_avail allocator and when swapping between the two
meta_avail and meta_free allocators.
We add the reserved metadata block count to statfs_more so that df can
subtract it from the free meta blocks and make it clear when enospc is
going to be returned for metadata allocations.
We increase the minimum device size in mkfs so that small testing
devices provide sufficient reserved blocks.
And finally we add a little test that makes sure we can fill both
metadata and data to ENOSPC and then recover by deleting what we filled.
Signed-off-by: Zach Brown <zab@versity.com>
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.
Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.
RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.
Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.
xfstests common/004 now runs because tmpfile is supported.
Signed-off-by: Andy Grover <agrover@versity.com>
Add a relatively constrained ioctl that moves extents between regular
files. This is intended to be used by tasks which combine many existing
files into a much larger file without reading and writing all the file
contents.
Signed-off-by: Zach Brown <zab@versity.com>
By convention we have the _IO* ioctl definition after the argument
structs and ALLOC_DETAIL got it a bit wrong so move it down.
Signed-off-by: Zach Brown <zab@versity.com>
This more closely matches stage ioctl and other conventions.
Also change release code to use offset/length nomenclature for consistency.
Signed-off-by: Andy Grover <agrover@versity.com>
Prefer named to anonymous enums. This helps readability a little.
Use enum as param type if possible (a couple spots).
Remove unused enum in lock_server.c.
Define enum spbm_flags using shift notation for consistency.
Rename get_file_block()'s "gfb" parameter to "flags" for consistency.
Signed-off-by: Andy Grover <agrover@versity.com>
Instead, explicitly add padding field, and adjust member ordering to
eliminate compiler-added padding between members, and at the end of the
struct (if possible: some structs end in a u8[0] array.)
This should prevent unaligned accesses. Not a big deal on x86_64, but
other archs like aarch64 really want this.
Signed-off-by: Andy Grover <agrover@versity.com>
The total_{meta,data}_blocks scoutfs_super_block fields initialized by
mkfs aren't visible to userspace anywhere. Add them to statfs_more so
that tools can get the totals (and use them for df, in this particular
case).
Signed-off-by: Zach Brown <zab@versity.com>
An an ioctl which copies details of each persistent allocator to
userspace. This will be used by a scoutfs command to give information
about the allocators in the system.
Signed-off-by: Zach Brown <zab@versity.com>
Add the committed_seq to statfs_more which gives the greatest seq which
has been committed. This lets callers disocover that a seq for a change
they made has been committed.
Signed-off-by: Zach Brown <zab@versity.com>
Using strictly coherent btree items to map the hash of xattr names to
inode numbers proved the value of the functionality, but it was too
expensive. We now have the more efficient srch infrastructure to use.
We change from the .indx. to the .srch. tag, and change the ioctl from
find_xattr to search_xattrs. The idea is to communicate that these are
accelerated searches, not precise index lookups and are relatively
expensive.
Rather than maintaining btree items, xattr setting and deleting emits
srch entries which either tracks the xattr or combines with the previous
tracker and removes the entry. These are done under the lock that
protects the main xattr item, we can remove the separate locking of the
previous index items.
The semantics of the search ioctl needs to change a bit. Because
searches are so expensive we now return a flag to indicate that the
search completed. While we're there, we also allow a last_ino parameter
so that searches can be divided up and run in parallel.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for reporting errors to data waiters via a new
SCOUTFS_IOC_DATA_WAIT_ERR ioctl. This allows waiters to return an error
to readers when staging fails.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
[zab: renamed to data_wait_err, took ino arg]
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can fill a user struct with file system info. We're
going to use this to find the fsid and rid of a mount.
Signed-off-by: Zach Brown <zab@versity.com>
Our hidden attributes are hidden so that they don't leak out of
the system when archiving tools transfer xattrs from listxattr along
with the file. They're not intended to be secret, in fact users want to
see their contents like they want to see other fs metadata that they
can't update which describes the system.
Make our listxattr ioctl only return hidden xattrs and allow anyone to
see the results if they can read the file. Rename it to more
accurately describe its intended use.
Signed-off-by: Zach Brown <zab@versity.com>
Order the ioctl struct field definitions and add padding so that
runtimes with different word dizes don't add different padding.
Userspace is spared having to deal with packing and we don't
have to worry about compat translation in the kernel.
We had two persistent structures that crossed the ioctl, a key and a
timespec, so we explicitly translate to and from their persistent types
in the ioctl.
Signed-off-by: Zach Brown <zab@versity.com>
Add a .indx. xattr tag which adds the inode to an index of inodes keyed
by the hash of xattr names. An ioctl is added which then returns all
the inodes which may contain an xattr of the given name. Dropping all
xattrs now has to parse the name to find out if it also has to delete an
index item.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl which can be used to iterate over the keys for all the
xattrs on an inode. It is privileged, can see hidden inodes, and has an
iteration cursor so that it can make its way through very large numbers
of xattrs.
Signed-off-by: Zach Brown <zab@versity.com>
Add an ioctl that can be used by userspace to restore a file to its
offline state. To do that it needs to set inode fields that are
otherwise not exposed and create an offline extent.
Signed-off-by: Zach Brown <zab@versity.com>
One of the core features of scoutfs is the ability to transparently
migrate file contents to and from an archive tier. For this to be
transparent we need file system operations to trigger staging the file
contents back into the file system as needed.
This adds the infrastructure which operations use to wait for offline
extents to come online and which provides userspace with a list of
blocks that the operations are waiting for.
We add some waiting infrastructure that callers use to lock, check for
offline extents, and unlock and wait before checking again to see if
they're still offline. We add these checks and waiting to data io
operations that could encounter offline extents.
This has to be done carefully so that we don't wait while holding locks
that would prevent staging. We use per-task structures to discover when
we are the first user of a cluster lock on an inode, indicating that
it's safe for us to wait because we don't hold any locks.
And while we're waiting our operation is tracked and reported to
userspace through an ioctl. This is a non-blocking ioctl, it's up to
userspace to decide how often to check and how large a region to stage.
Waiters are woken up when the file contents could have changed, not
specifically when we know that the extent has come online. This lets us
wake waiters when their lock is revoked so that they can block waiting
to reacquire the lock and test the extents again. It lets us provide
coherent demand staging across the cluster without fine grained waiting
protocols sent betwen the nodes. It may result in some spurious wakeups
and work but hopefully it won't, and it's a very simple and functional
first pass.
Signed-off-by: Zach Brown <zab@versity.com>
Variable length keys lead to having a key struct point to the buffer
that contains the key. With dirents and xattrs now using small keys we
can convert everyone to using a single key struct and significantly
simplify the system.
We no longer have a seperate generic key buf struct that points to
specific per-type key storage. All items use the key struct and fill
out the appropriate fields. All the code that paired a generic key buf
struct and a specific key type struct is collapsed down to a key struct.
There's no longer the difference between a key buf that shares a
read-only key, has it's own precise allocation, or has a max size
allocation for incrementing and decrementing.
Each key user now has an init function fills out its fields. It looks a
lot like the old pattern but we no longer have seperate key storage that
the buf points to.
A bunch of code now takes the address of static key storage instead of
managing allocated keys. Conversely, swapping now uses the full keys
instead of pointers to the keys.
We don't need all the functions that worked on the generic key buf
struct because they had different lengths. Copy, clone, length init,
memcpy, all of that goes away.
The item API had some functions that tested the length of keys and
values. The key length tests vanish, and that gets rid of the _same()
call. The _same_min() call only had one user who didn't also test for
the value length being too large. Let's leave caller key constraints in
callers instead of trying to hide them on the other side of a bunch of
item calls.
We no longer have to track the number of key bytes when calculating if
an item population will fit in segments. This removes the key length
from reservations, transactions, and segment writing.
The item cache key querying ioctls no longer have to deal with variable
length keys. The simply specify the start key, the ioctls return the
number of keys copied instead of bytes, and the caller is responsible
for incrementing the next search key.
The segment no longer has to store the key length. It stores the key
struct in the item header.
The fancy variable length key formatting and printing can be removed.
We have a single format for the universal key struct. The SK_ wrappers
that bracked calls to use preempt safe per cpu buffers can turn back
into their normal calls.
Manifest entries are now a fixed size. We can simply split them between
btree keys and values and initialize them instead of allocating them.
This means that level 0 entries don't have their own format that sorts
by the seq. They're sorted by the key like all the other levels.
Compaction needs to sweep all of them looking for the oldest and read
can stop sweeping once it can no longer overlap. This makes rare
compaction more expensive and common reading less expensive, which is
the right tradeoff.
Signed-off-by: Zach Brown <zab@versity.com>
Directory entries were the last items that had large variable length
keys because they stored the entry name in the key. We'd like to have
small fixed size keys so let's store dirents with small keys.
Entries for lookup are stored at the hash of the name instead of the
full name. The key also contains the unique readdir pos so that we
don't have to deal with collision on creation. The lookup procedure now
does need to iterate over all the readdir positions for the hash value
and compare the names.
Entries for link backref walking are stored with the entry's position in
the parent dir instead of the entry's name. The name is then stored in
the value. Inode to path conversion can still walk the backref items
without having to lookup dirent items.
These changes mean that all directory entry items are now stored at a
small key with some u64s (hash, pos, parent dir, etc) and have a value
with the dirent struct and full entry name. This lets us use the same
key and value format for the three entry key types. We no longer have
to allocate keys, we can store them on the stack.
We store the entry's hash and pos in the dirent struct in the item value
so that any item has all the fields to reference all the other item
keys. We store the same values in the dentry_info so that deletion
(unlink and rename) can find all the entries.
The ino_path ioctl can now much more clearly iterate over parent
directories and entry positions instead of oh so cleverly iterating over
null terminated names in the parent directories. The ioctl interface
structs and implementation become simpler.
Signed-off-by: Zach Brown <zab@versity.com>
We aren't using the size index. It has runtime and code maintenance
costs that aren't worth paying. Let's remove it.
Removing it from the format and no longer maintaining it are straight
forward.
The bulk of this patch is actually the act of removing it from the index
locking functions. We no longer have to predict the size that will be
stored during the transaction to lock the index items that will be
created during the transaction. A bunch of code to predict the size and
then pass it into locking and transactions goes away. Like other inode
fields we now update the size as it changes.
Signed-off-by: Zach Brown <zab@versity.com>
We're going to be strictly enforcing matching format.h and ioctl.h
between userspace and kernel space. Let's get the exported kernel
function definition out of ioctl.h.
Signed-off-by: Zach Brown <zab@versity.com>
Like the mtime index, this index is unused. Removing it is a near
identical task. Running the same createmany test from our last
patch gives us the following:
$ createmany -o '/scoutfs/file_%lu' 10000000
total: 10000000 creates in 598.28 seconds: 16714.59 creates/second
real 9m58.292s
user 0m7.420s
sys 5m44.632s
So after both indices are gone, we go from a 12m56 run time to 9m58s,
saving almost 3 minutes which translates into a total performance
increase of about 23%.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
This index is unused - we can gain some create performance by removing it.
To verify this, I ran createmany for 10 million files:
$ createmany -o '/scoutfs/file_%lu' 10000000
Before this patch:
total: 10000000 creates in 776.54 seconds: 12877.56 creates/second
real 12m56.557s
user 0m7.861s
sys 6m56.986s
After this patch:
total: 10000000 creates in 691.92 seconds: 14452.46 creates/second
real 11m31.936s
user 0m7.785s
sys 6m19.328s
So removing the index gained us about a minute and a half on the test or a
12% performance increase.
Signed-off-by: Mark Fasheh <mfasheh@versity.com>
The existing release interface specified byte regions to release but
that didn't match what the underlying file data mapping structure is
capable of. What happens if you specify a single byte to release? Does
it release the whole block? Does it release nothing? Does it return an
error?
By making the interface match the capability of the operation we make
the functioning of the system that much more predictable. Callers are
forced to think about implementing their desires in terms of block
granular releasing.
Signed-off-by: Zach Brown <zab@versity.com>
Raw [su]{8,16,32,64} types keep leaking into our exported headers where
they break userspace builds. Make sure that we only use the exported __
types and add a check to break our build if we get it wrong.
Signed-off-by: Zach Brown <zab@versity.com>
For each transaction we send a message to to the server asking for a
unique sequence number to associate with the transaction. When we
change metadata or data of an inode we store the current transaction seq
in the inode and we index it with index items like the other inode
fields.
The server remembers the sequences it gives out. When we go to walk the
inode sequence indexes we ask the server for the largest stable seq and
limit results to that seq. This ensures that we never return seqs that
are past dirty items so never have inodes and seqs appear in the past.
Nodes use the sync timer to regularly cycle through seqs and ensure that
inode seq index walks don't get stuck on their otherwise idle seq.
Signed-off-by: Zach Brown <zab@versity.com>
Add items for indexing inodes by their fields. When we update the inode
item we also delete the old index items and create the new items. We
rename and refactor the old inode since ioctl to now walk the inode
index items.
Signed-off-by: Zach Brown <zab@versity.com>
For consistency and to keep upstream users (scout-utils, etc) from
needing to include different type headers, we'll change the type to
match the rest of the header.
Signed-off-by: Nic Henke <nic.henke@versity.com>
The current plan for finding populations of inodes to search no longer
involves xattr backrefs. We're about to change the xattr storage format
so let's remove these interfaces so we don't have to update them.
Signed-off-by: Zach Brown <zab@versity.com>
Convert the link backref code from btree items to the item cache.
Now that the backref items have the full entry name we can traverse a
link with one item lookup. We don't need to lock the inode and verify
that the entry at the backref offset really points to our inode. The
link backref walk gets a lot simpler.
But we have to widen the ioctl cursor to store a full dir ino and path
name isntead of just the dir's backref counter.
Signed-off-by: Zach Brown <zab@versity.com>
This adds the ioctl for writing archived file contents back into the
file if the data_version still matches.
Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Add the _OFFLINE flag to indicate offline extents. The release ioctl
frees extents within the release range and sets their _OFFLINE flag if
the data_version still matches.
We tweak the existing truncate item function just a bit to support
making extents offline. We make it take an explicit range of blocks to
remove instead of just giving it the size and it learns to mark extents
offline and update them instead of always deleting them.
Reads from offline extents return zeros like reading from a sparse
region (later it will trigger demand staging) and writing to offline
extents clears the offline flag (later only staging can do that).
Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
A few xfstests tests were failing because they tried to create a decent
number of hard links to a file.
We had a small nlink limit because the inode-paths ioctl copied all the
paths for all the hard links to a userspace buffer which could be
enormous if there was a larger nlink limit.
The hard link backref disk format already has a natural counter that
could be used as a cursor to iterate over all the hard links that point
to a given inode.
This refactors the inode_paths ioctl into a ino_path ioctl that returns
a single path for the given counter and returns the counter for the next
path that links to the inode. Happily this lets us get rid of all the
weird path component lists and allocations. Now there's just the kernel
path buffer that gets null terminated path components and the userspace
buffer that those are copied to.
We don't fully relax the nlink limit. stat(2) returns the link count as
a u32. We go a step further and limit it to S32_MAX so that apps might
avoid sign bugs. That still gives us a more generous limit than ext4
and btrfs which are around U16_MAX.
Signed-off-by: Zach Brown <zab@versity.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
We don't overwrite existing data. Every file data write has to allocate
new blocks and update block mapping items.
We can search for inodes whose data has changed by filtering block
mapping item walks by the sequence number. We do this by using the
exact same code for finding changed inodes but using the block mapping
key type.
Signed-off-by: Zach Brown <zab@versity.com>
Add ioctls that return the inode numbers that probably contain the given
xattr name or value. To support these we add items that index inodes by
the presence of xattr items whose names or values hash to a give hash
value.
Signed-off-by: Zach Brown <zab@versity.com>
This adds the ioctl that returns all the paths from the root to a given
inode. The implementation only traverses btree items to keep it
isolated from the vfs object locking and life cycles, but that could be
a performance problem. This is another motivation to accelerate the
btree code.
Signed-off-by: Zach Brown <zab@versity.com>
Oh, thank goodness. It turns out that there's a crash extension for
working with tracepoints in crash dumps. Let's use standard tracepoints
and pretend this tracing hack never happened.
Signed-off-by: Zach Brown <zab@versity.com>
Add the ioctl that let's us find out about inodes that have changed
since a given sequence number.
A sequence number is added to the btree items so that we can track the
tree update that it last changed in. We update this as we modify
items and maintain it across item copying for splits and merges.
The big change is using the parent item ref and item sequence numbers
to guide iteration over items in the tree. The easier change is to have
the current iteration skip over items whose sequence number is too old.
The more subtle change has to do with how iteration is terminated. The
current termination could stop when it doesn't find an item because that
could only happen at the final leaf. When we're ignoring items with old
seqs this can happen at the end of any leaf. So we change iteration to
keep advancing through leaf blocks until it crosses the last key value.
We add an argument to btree walking which communicates the next key that
can be used to continue iterating from the next leaf block. This works
for the normal walk case as well as the seq walking case where walking
terminates prematurely in an interior node full of parent items with old
seqs.
Now that we're more robustly advancing iteration with btree walk calls
and the next key we can get rid fo the 'next_leaf' hack which was trying
to do the same thing inside the btree walk code. It wasn't right for
the seq walking case and was pretty fiddly.
The next_key increment could wrap the maximal key at the right spine of
the tree so we have _inc saturate instead of wrap.
And finally, we want these inode scans to not have to skip over all the
other items associated with each inode as it walks looking for inodes
with the given sequence number. We change the item sort order to first
sort by type instead of by inode. We've wanted this more generally to
isolate item types that have different access patterns.
Signed-off-by: Zach Brown <zab@versity.com>
This adds tracing functionality that's cheap and easy to
use. By constantly gathering traces we'll always have rich
history to analyze when something goes wrong.
Signed-off-by: Zach Brown <zab@versity.com>