Greg KH tells us to do just this in v5.4-rc5-31-g9927c6fa3e1d:
No one checks the return value of debugfs_create_atomic_t(),
as it's not needed, so make the return value void, so that no
one tries to do so in the future.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v5.12-rc6-9-g4f0f586bf0c8
All list_sort functions use the list_cmp_func_t type, which compares
list_head member types. These are now required to be `const` as the
compiler will now check them. This propagates into our callers.
Signed-off-by: Auke Kok <auke.kok@versity.com>
v5.7-rc2-1174-gfd4f12bc38c3 significantly rewrites the bpf iterator
which hits this _next() function. It also adds a check that verifies
that the *pos is incremented after every call, even if it goes beyond
the last member (in which case it's not used).
Signed-off-by: Auke Kok <auke.kok@versity.com>
v5.11-rc4-7-g2f221d6f7b88 Changes setattr_prepare from an extern
to plain int. There's no impact further to the compat to keep it
working except for the detection regex.
Signed-off-by: Auke Kok <auke.kok@versity.com>
We could use sizeof_field as a direct replacement (which is the same)
except that this entire thing can directly use offsetofend().
Signed-off-by: Auke Kok <auke.kok@versity.com>
The wrapper in setattr_more that translates the operations to attr_x
needs to decide whether to ask attr_x to perform a change to any of
the fields passed to it or not. For the date and size fields this
is implicit - we always tell attr_x to change them. For any of the
other fields, it should be explicit.
The only field that is in the struct that this applies to is
data_version. Because the data version field by default is zero,
we use that as condition to decide whether to pass the data_version
down to attr_x.
Previously, the code would always pass a data_version=0 down to attr_x,
triggering one of the validity checks, making it return -EINVAL. We
add a simple test case to test for this issue.
Signed-off-by: Auke Kok <auke.kok@versity.com>
These new shrinkers were recently added. Because there's very little
ways to debug them, or even see them properly function, we should at
least add counters for them.
Signed-off-by: Auke Kok <auke.kok@versity.com>
In 29160b0b I mistakenly disabled all caching of ACLs for el8
instead of only disabling cache lookups. The correct change
should have been to disable cache lookups only, and leave setting the
acl cache after storing or fetching, as the kernel needs this data
to resolve acls when doing permission checks.
Restore the acl cache insertions fixes.
Signed-off-by: Auke Kok <auke.kok@versity.com>
Add support for the indx xattr tag which lets xattrs determine the sort
order of by their inode number in a global index.
Signed-off-by: Zach Brown <zab@versity.com>
Add support for project IDs. They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.
Signed-off-by: Zach Brown <zab@versity.com>
Change the read_xattr_totls ioctl to use the weak item cache instead of
manually reading and merging the fs items for the xattr totals on every
call.
Signed-off-by: Zach Brown <zab@versity.com>
The _READ_XATTR_TOTALS ioctl had manual code for merging the .totl.
total and value while reading fs items. We're going to want to do this
in another reader so let's put these in their own funcions that clearly
isolate the logic of merging the fs items into a coherent result.
We can get rid of some of the totl_read_ counters that tracked which
items we were merging. They weren't adding much value and conflated the
reading ioctl interface with the merging logic.
Signed-off-by: Zach Brown <zab@versity.com>
Add a forest item reading interface that lets the caller specify the net
roots instead of always getting them from a network request.
Signed-off-by: Zach Brown <zab@versity.com>
Add the weak item cache that is used for reads that can handle results
being a little behind. This gives us a lot more freedom to implement
the cache that biases concurrent reads.
Signed-off-by: Zach Brown <zab@versity.com>
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode. The bit is visible through the _attr_x
interface. It can only be set on regular files and when set it prevents
modification to all but non-user xattrs. It can be cleared by root.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we have the attr_x calls we can implement stat_more with
get_attr_x and setattr_more with set_attr_x.
The conversion of stat_more fixes a surprising consistency bug.
stat_more wasn't acquiring a cluster lock for the inode nore refreshing
it so it could have returned stale data if modifications were made in
another mount.
Signed-off-by: Zach Brown <zab@versity.com>
The existing stat_more and setattr_more interfaces aren't extensible.
This solves that problem by adding attribute interfaces which specify
the specific fields to work with.
We're about to add a few more inode fields and it makes sense to add
them to this extensible structure rather than adding more ioctls or
relatively clumsy xattrs. This is modeled loosely on the upstream
kernel's statx support.
The ioctl entry points call core functions so that we can also implement
the existing stat_more and setattr_more interfaces in terms of these new
attr_x functions.
Signed-off-by: Zach Brown <zab@versity.com>
Initially setattr_more followed the general pattern where extent
manipulation might require multiple transactions if there are lots of
extent items to work with. The scoutfs_data_init_offline_extent()
function that creates an offline extent handled transactions itself.
But in this case the call only supports adding a single offline extent.
It will always use a small fixed amount of metadata and could be
combined with other metadata changes in one atomic transaction.
This changes scoutfs_data_init_offline_extent() to have the caller
handle transactions, inode updates, etc. This lets the caller perform
all the restore changes in one transaction. This interface change will
then be used as we add another caller that adds a single offline extent
in the same way.
Signed-off-by: Zach Brown <zab@versity.com>
Add a little inline helper to test whether the mounted format version
supports a feature or not, returning an errno that callers can use when
they can return a shared expected error.
Signed-off-by: Zach Brown <zab@versity.com>
We're about to add new format structures so increment the max version to
2. Future commits will add the features before we release version 2 in
the wild.
Signed-off-by: Zach Brown <zab@zabbo.net>
We're about to increase the inode size and increment the format version.
Inode reading and writing has to handle different valid inode sizes as
allowed by the format version. This is the initial skeletal work that
later patches which really increase the inode size will further refine
to add the specific known sizes and format versions.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: reworded description, reworked to use _within]
Signed-off-by: Zach Brown <zab@versity.com>
Add a lookup variant that returns an error if the item value is larger
than the caller's value buffer size and which zeros the rest of the
caller's buffer if the returned value is smaller.
Signed-off-by: Zach Brown <zab@versity.com>
We were using a seqcount to protect high frequency reads and writes to
some of our private inode fields. The writers were serialized by the
caller but that's a bit too easy to get wrong. We're already storing
the write seqcount update so the additional internal spinlock stores in
seqlocks isn't a significant additional overhead. The seqlocks also
handle preemption for us.
Signed-off-by: Zach Brown <zab@versity.com>
Definitions in forest.h use lock pointers. Pre-declare the struct so it
doesn't break inclusion without lock.h, following current practice in
the header.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_file_write_iter tried to track written bytes and return those
unless there was an error. But written was uninitialized if we got
errors in any of the calls leading up to performing the write. The
bytes written were also not being passed to the generic_write_sync
helper. This fixes up all those inconsistencies and makes it look like
the write_iter path in other filesystems.
Signed-off-by: Zach Brown <zab@versity.com>
When we write to file contents we change the data_version. To stage old
contents into an offline region the data_version of the file must match
the archived copy. When writing we have to make sure that there is no
offline data so that we don't increase the data_version which will
prevent staging of any other file regions because the data_versions no
longer match.
scoutfs_file_write_iter was only checking for offline data in its write
region, not the entire file. Fix it to match the _aio_write method and
check the whole file.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_data_wait_check_iter() was checking the contiguous region of the
file starting at its pos and extending for iter_iov_count() bytes. The
caller can do that with the previous _data_wait_check() method by
providing the same count that _check_iter() was using.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache has a bit of safety checks that make sure that an
operation is performed while holding a lock that covers the item. It
dumped a stack trace via WARN when that wasn't true, but it didn't
include any details about the keys or lock modes involved.
This adds a message that's printed once which includes the keys and
modes when an operation is attempted that isn't protected.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_create() was checking that its lock had a read mode, when
it should have been checking for a write mode. This worked out because
callers with write mode locks are also protecting reads.
Signed-off-by: Zach Brown <zab@versity.com>
Unlink looks up the entry items for the name it is removing because we
no longer store the extra key material in dentries. If this lookup
fails it will use an error path which release a transaction which wasn't
held. Thankfully this error path is unlikely (corruption or systemic
errors like eio or enomem) so we haven't hit this in practice.
Signed-off-by: Zach Brown <zab@versity.com>
Block reads can return ESTALE naturally as mounts read through old
cached blocks. We won't always log it as an error but we should add a
tracepoint that can be inspected.
Signed-off-by: Zach Brown <zab@versity.com>
This addresses some minor issues with how we handle driving the
weak-modules infrastructure for handling running on kernels not
explicitly built for.
For one, we now drive weak-modules at install-time more explicitly (it
was adding symlinks for all modules into the right place for the running
kernel, whereas now it only handles that for scoutfs against all
installed kernels).
Also we no longer leave stale modules on the filesystem after an
uninstall/upgrade, similar to what's done for vsm's kmods right now.
RPM's pre/postinstall scriptlets are used to drive weak-modules to clean
things up.
Note that this (intentionally) does not (re)generate initrds of any
kind.
Finally, this was tested on both the native kernel version and on
updates that would need the migrated modules. As a result, installs are
a little quicker, the module still gets migrated successfully, and
uninstalls correctly remove (only) the packaged module.
server_log_merge_free_work() is responsible for freeing all the input
log trees for a log merge operation that has finished. It looks for the
next item to free, frees the log btree it references, and then deletes
the item. It was doing this with a full server commit for each item
which can take an agonizingly long time.
This changes it perform multiple deletions in a commit as long as
there's plenty of alloc space. The moment the commit gets low it
applies the commit and opens a new one. This sped up the deletion of a
few hundred thousand log tree items from taking hours to seconds.
Signed-off-by: Zach Brown <zab@versity.com>
The btree_merge code was pinning leaf blocks for all input btrees as it
iterated over them. This doesn't work when there are a very large
number of input btrees. It can run out of memory trying to hold a
reference to a 64KiB leaf block for each input root.
This reworks the btree merging code. It reads a window of blocks from
all input trees to get a set of merged items. It can take multiple
passes to complete the merge but by setting the merge window large
enough this overhead is reduced. Merging now consumes a fixed amount of
memory rather than using memory proportional to the number of input
btrees.
Signed-off-by: Zach Brown <zab@versity.com>