Commit Graph

1338 Commits

Author SHA1 Message Date
Zach Brown
8c1a45c9f5 Use bools instead of weird addition as or in net
When freeing acked reesponses in the net layer we sweep the send and
resend queues looking for queued responses up to the sequence number
we've had acked.  The code that did this used a weird pattern of
returning ints and adding them which gave me pause.  Clean it up to use
bools and or (not short-circuiting ||) to more obviously communicate
what's going on.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-30 13:38:12 -07:00
Zach Brown
5a6eb569f3 Add some lock debugging trace fields
Over time some fields have been added to the lock struct which haven't
been added to the lock tracing output.  Add some of the more relevant
lock fields to tracing.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-30 13:16:04 -07:00
Zach Brown
69d9040e68 Close lock server use-after-free race
Lock object lifetimes in the lock server are protected by reference
counts.  References are acquired while holding a lock on an rbtree.

Unfortunately, the decision to free lock objects wasn't tested while
also holding that lock on the rbtree.  A caller putting their object
would test the refcount, then wait to get the rbtree lock to remove it
from the tree.

There's a possible race where the decision is made to remove the object
but another reference is added before the object is removed.  This was
seen in testing and manifest as an incoming request handling path adding
a request message to the object before it is freed, losing the message.
Clients would then hang on a lock that never saw a response because
their request was freed with the lock object.

The fix is to hold the rbtree lock when testing the refcount and
deciding to free.  It adds a bit more contention but not significantly
so, given the wild existing contention on a per-fs spinlocked rbtree.

Signed-off-by: Zach Brown <zab@versity.com>
2024-10-30 13:04:13 -07:00
Greg Cymbalski
70c36ae394 Generate debug packages
We had previously explicitly disabled this; let's start generating them.
2024-10-24 14:56:09 -07:00
Auke Kok
235ab133a7 We must provide a_ops->dirty_folio and invalidate_folio.
In v5.17-rc4-53-g3a3bae50af5d, we can no longer omit having this
method unhooked as the mm caller blindly calls it now. In-kernel
filesystems all were fixed in this change.

aops->invalidatepage was the old aops method that would free pages
with private attached data. This method is replaced with the
new invalidate_folio method. If this method is NULL, the memory
will become orphaned. (v5.17-rc4-29-gf50015a596fa)

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 15:38:34 -07:00
Auke Kok
7d0e7e29f8 Avoid integer wrapping pitfalls for (off, len) pairs.
We use check_add_overflow(a, b, d) here to validate that (off, len)
pairs do not exceed the max value type. The kernel conveniently has
several macros to sort out the problems with signed or unsigned types.

However, we're not interested in purely seeing whether (a + b)
overflows, because we're using this for (off, len) overflow checks,
where the bytes we read are from 0 to len -1. We must therefore call
this check with (b) being "len - 1".

I've made sure that we don't accidentally fail when (len == 0)
in all cases by making sure we've already checked this condition
before, and moving code around as needed to ensure that (len > 0)
in all cases where we check.

The macro check_add_overflow requires a (d) argument in which
temporarily the result of the addition is stored and then checked to see
if an overflow occurred. We put a `tmp` variable on the stack of the
correct type as needed to make the checks function.

simple-release-extents test mistakenly relied on this buggy wrap code,
so it needs fixing. The move-blocks test also got it wrong.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
69de6d7a74 Check for zero len in scoutfs_data_wait_check
We consistently enter scoutfs_data_wait_check when len == 0 from
scoutfs_aio_write() which directly passes the i_size_read() value,
and for cases where we `echo >> $FILE` this is always reached.

This can cause the wrapping check to fail since `0 + (0 - 1) < 0` which
triggers the WARN_ON_ONCE wrap check that needs updating to allow
certain operations on huge files.

More importantly we can just omit all these checks if `len == 0` anyway,
since they should always succeed and never should require taking all the
locks.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
c298360a49 blkdev api changes - pass holder to replace FMODE_EXCL
Passing a holder ptr to these functions now replaces the FMODE_EXCL
flag. _put no longer needs flags for this reason, but the holder
instead.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
95f4e56546 Introduce blk_mode_t instead of abuse of fmode_t
v6.4-rc2-198-g05bdb9965305 adds a new type for passing flags instead
of abusing fmode_t flags. They are essentially the same flags just
in a new type.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
d5c2768f04 .tmpfile method now passed a struct file, which must be opened.
v6.0-rc6-9-g863f144f12ad changes the VFS method to pass in a struct
file and not a dentry in preperation for tmpfile support in fuse.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
676d429264 Assume el9 is the same as el8 for rpmbuild purposes.
The current spec template can't handle future major el releases
gracefully and fails to build entirely. We isolate all changes
so that they are either "el7 specific" or generic. This rids us
entirely of el8 specific conditionals.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
5b260e6b54 block_write_begin() no longer is being passed aop_flags.
The flag is now obsolete, we don't need to set flags here or
pass them.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
e2b06f2c92 mpage_readpage() is now replaced with mpage_read_folio.
Folios are the new data types used for passing pages. For now,
folios only appear to have a single page. Future kernels will
change that.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
546b437df7 Shrinkers are now registered with a name.
v5.19-rc4-52-ge33c267ab70d Adds shrinker names to the registration
call to aid with shrinker debugging, which is highly opaque.

To enable you'll have to recompile the kernel with
CONFIG_SHRINKER_DEBUG=y though, since it's disabled by default in
OSV kernels.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
381f4543b7 Use iter based read/write to support splice and thus sendfile().
The iter based read/write calls can support splice in el9 if we
hook up these calls, otherwise splice will stop working.

->write() similar to: v3.15-rc4-330-g8d0207652cbe. ->read() to
generic implementation.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
418a441604 kernel_setsockopt no longer available.
We instead opt to use sock_setsockopt which is generally exactly the
same and can be easily converted to map to kernel_setsockopt without
impacting the code significantly.

There are 3 methods we're calling with usec timeval's, and that is
significantly different now that this requires a bit more compat code
so we split these out to separate compat functions to handle them.

Some of the TCP sock functions also have a slightly different signature
that we want to split them out (struct socket vs. sock). Some further
no longer return success, either.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
f3abf9710b generic_perform_write signature changed
It now only needs the iocb and no longer the flip.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
8a45c2baff Deprecate struct timeval.
We switch to using 64bit usec structs and recommended replacement
functions from Documentation/core-api/timekeeping.rst.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
345ebd0876 fiemap_prep replaces fiemap_check_flags.
v5.7-rc4-53-gcddf8a2c4a82

The prep helper replaces the sanity checks.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
b718cf09de Handle idmapped mounts in xattr_handler
In v5.11-rc4-8-ge65ce2a50cf6 the *set handler is passed a
user_namespace struct pointing to the map from the mount.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
e4721366ff Added user_ns argument to posix_acl_update_mode, set_posix_acl
v5.11-rc4-8-ge65ce2a50cf6 adds idmap support to these calls.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
4ef64c6fcf Vfs methods become user namespace mount aware.
v5.11-rc4-24-g549c7297717c

All of these VFS methods are now passed a user_namespace.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
2d58ee2a37 Account for new bio_alloc() args.
Block device and opf are now passed through and set. We mimic compat
code to do the same.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
1f0dd7f025 __vmalloc defaults to PAGE_KERNEL everywhere, so the arg was removed.
v5.7-523-g88dca4ca5a93

__vmalloc no longer has the 3rd argument.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
077468ac1e debugfs_create_atomic_t now returns void, don't check result
Greg KH tells us to do just this in v5.4-rc5-31-g9927c6fa3e1d:

	No one checks the return value of debugfs_create_atomic_t(),
	as it's not needed, so make the return value void, so that no
	one tries to do so in the future.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
c951713ab2 list_cmp_func_t introduced, using const.
v5.12-rc6-9-g4f0f586bf0c8

All list_sort functions use the list_cmp_func_t type, which compares
list_head member types. These are now required to be `const` as the
compiler will now check them. This propagates into our callers.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
ad82a5e52a Squelch warning from bpf_iter.c.
v5.7-rc2-1174-gfd4f12bc38c3 significantly rewrites the bpf iterator
which hits this _next() function. It also adds a check that verifies
that the *pos is incremented after every call, even if it goes beyond
the last member (in which case it's not used).

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
d3c5328909 setattr_prepare no longer extern in fs.h
v5.11-rc4-7-g2f221d6f7b88 Changes setattr_prepare from an extern
to plain int. There's no impact further to the compat to keep it
working except for the detection regex.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
c30172210f Use blk_opf_t to pass bio op flags
Compat is back to unsigned int.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
19af6e28fb "unaligned/access_ok.h" is not needed, and removed.
Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
8885486bc8 Add several low level includes.
Newer kernels include less header dependencies by default, so we have
to add these.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
0204e092e4 FIELD_SIZEOF was deprecated.
We could use sizeof_field as a direct replacement (which is the same)
except that this entire thing can directly use offsetofend().

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-10-03 12:41:05 -07:00
Auke Kok
b45fbe0bbb Don't pass data version to attr_x unless the ioctl means to set it.
The wrapper in setattr_more that translates the operations to attr_x
needs to decide whether to ask attr_x to perform a change to any of
the fields passed to it or not. For the date and size fields this
is implicit - we always tell attr_x to change them. For any of the
other fields, it should be explicit.

The only field that is in the struct that this applies to is
data_version. Because the data version field by default is zero,
we use that as condition to decide whether to pass the data_version
down to attr_x.

Previously, the code would always pass a data_version=0 down to attr_x,
triggering one of the validity checks, making it return -EINVAL. We
add a simple test case to test for this issue.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-09-27 19:31:22 -04:00
Greg Cymbalski
4dde57dc27 Rely on $PATH for weak-modules
This avoids having to deal with EL-specific path differences for
the weak-modules script.
2024-09-27 12:30:25 -07:00
Auke Kok
fb93d82b1e Add shrinker counters for wkic and quota_info.
These new shrinkers were recently added. Because there's very little
ways to debug them, or even see them properly function, we should at
least add counters for them.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-09-18 13:40:54 -04:00
Auke Kok
ccd65b9a61 Fix POSIX ACL use in el8+.
In 29160b0b I mistakenly disabled all caching of ACLs for el8
instead of only disabling cache lookups. The correct change
should have been to disable cache lookups only, and leave setting the
acl cache after storing or fetching, as the kernel needs this data
to resolve acls when doing permission checks.

Restore the acl cache insertions fixes.

Signed-off-by: Auke Kok <auke.kok@versity.com>
2024-08-09 17:57:23 -04:00
Zach Brown
38c6d66ffc Add indx xattr tag support
Add support for the indx xattr tag which lets xattrs determine the sort
order of by their inode number in a global index.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
38e6f11ee4 Add quota support
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
4a8240748e Add project ID support
Add support for project IDs.  They're managed through the _attr_x
interfaces and are inherited from the parent directory during creation.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
9c45e8b7ef read_xattr_totls ioctl uses weak item cache
Change the read_xattr_totls ioctl to use the weak item cache instead of
manually reading and merging the fs items for the xattr totals on every
call.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
ee9e8c3e1a Extract .totl. item merging into own functions
The _READ_XATTR_TOTALS ioctl had manual code for merging the .totl.
total and value while reading fs items.  We're going to want to do this
in another reader so let's put these in their own funcions that clearly
isolate the logic of merging the fs items into a coherent result.

We can get rid of some of the totl_read_ counters that tracked which
items we were merging.  They weren't adding much value and conflated the
reading ioctl interface with the merging logic.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
5f156b7a36 Add scoutfs_forest_read_items_roots
Add a forest item reading interface that lets the caller specify the net
roots instead of always getting them from a network request.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
3a51ca369b Add the weak item cache
Add the weak item cache that is used for reads that can handle results
being a little behind.  This gives us a lot more freedom to implement
the cache that biases concurrent reads.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
fb5331a1d9 Add inode retention bit
Add a bit to the private scoutfs inode flags which indicates that the
inode is in retention mode.  The bit is visible through the _attr_x
interface.  It can only be set on regular files and when set it prevents
modification to all but non-user xattrs.  It can be cleared by root.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 15:09:05 -07:00
Zach Brown
270726a6ea Implement stat_more and setattr_more with attr_x
Now that we have the attr_x calls we can implement stat_more with
get_attr_x and setattr_more with set_attr_x.

The conversion of stat_more fixes a surprising consistency bug.
stat_more wasn't acquiring a cluster lock for the inode nore refreshing
it so it could have returned stale data if modifications were made in
another mount.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
6a99ca9ede Add attr_x core and ioctls
The existing stat_more and setattr_more interfaces aren't extensible.
This solves that problem by adding attribute interfaces which specify
the specific fields to work with.

We're about to add a few more inode fields and it makes sense to add
them to this extensible structure rather than adding more ioctls or
relatively clumsy xattrs.  This is modeled loosely on the upstream
kernel's statx support.

The ioctl entry points call core functions so that we can also implement
the existing stat_more and setattr_more interfaces in terms of these new
attr_x functions.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
0521bd0e6b Make offline extent creation use one transaction
Initially setattr_more followed the general pattern where extent
manipulation might require multiple transactions if there are lots of
extent items to work with.   The scoutfs_data_init_offline_extent()
function that creates an offline extent handled transactions itself.

But in this case the call only supports adding a single offline extent.
It will always use a small fixed amount of metadata and could be
combined with other metadata changes in one atomic transaction.

This changes scoutfs_data_init_offline_extent() to have the caller
handle transactions, inode updates, etc.  This lets the caller perform
all the restore changes in one transaction.  This interface change will
then be used as we add another caller that adds a single offline extent
in the same way.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Zach Brown
361491846d Add scoutfs_fmt_vers_unsupported()
Add a little inline helper to test whether the mounted format version
supports a feature or not, returning an errno that callers can use when
they can return a shared expected error.

Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00
Bryant G. Duffy-Ly
9ba4271c26 Add new max format version of 2
We're about to add new format structures so increment the max version to
2.  Future commits will add the features before we release version 2 in
the wild.

Signed-off-by: Zach Brown <zab@zabbo.net>
2024-06-28 14:53:49 -07:00
Bryant G. Duffy-Ly
90cfaf17d1 Initial support for different inode sizes
We're about to increase the inode size and increment the format version.
Inode reading and writing has to handle different valid inode sizes as
allowed by the format version.   This is the initial skeletal work that
later patches which really increase the inode size will further refine
to add the specific known sizes and format versions.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: reworded description, reworked to use _within]
Signed-off-by: Zach Brown <zab@versity.com>
2024-06-28 14:53:49 -07:00