We're about to add new format structures so increment the max version to
2. Future commits will add the features before we release version 2 in
the wild.
Signed-off-by: Zach Brown <zab@zabbo.net>
We're about to increase the inode size and increment the format version.
Inode reading and writing has to handle different valid inode sizes as
allowed by the format version. This is the initial skeletal work that
later patches which really increase the inode size will further refine
to add the specific known sizes and format versions.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: reworded description, reworked to use _within]
Signed-off-by: Zach Brown <zab@versity.com>
Add a lookup variant that returns an error if the item value is larger
than the caller's value buffer size and which zeros the rest of the
caller's buffer if the returned value is smaller.
Signed-off-by: Zach Brown <zab@versity.com>
We were using a seqcount to protect high frequency reads and writes to
some of our private inode fields. The writers were serialized by the
caller but that's a bit too easy to get wrong. We're already storing
the write seqcount update so the additional internal spinlock stores in
seqlocks isn't a significant additional overhead. The seqlocks also
handle preemption for us.
Signed-off-by: Zach Brown <zab@versity.com>
Don't let change-format-version decrease the format version. It doesn't
have the machinery to go back and migrate newer structures to older
structures that would be compatible with code expecting the older
version.
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
[zab@versity.com: split from initial patch with other changes]
Signed-off-by: Zach Brown <zab@versity.com>
Definitions in forest.h use lock pointers. Pre-declare the struct so it
doesn't break inclusion without lock.h, following current practice in
the header.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_file_write_iter tried to track written bytes and return those
unless there was an error. But written was uninitialized if we got
errors in any of the calls leading up to performing the write. The
bytes written were also not being passed to the generic_write_sync
helper. This fixes up all those inconsistencies and makes it look like
the write_iter path in other filesystems.
Signed-off-by: Zach Brown <zab@versity.com>
When we write to file contents we change the data_version. To stage old
contents into an offline region the data_version of the file must match
the archived copy. When writing we have to make sure that there is no
offline data so that we don't increase the data_version which will
prevent staging of any other file regions because the data_versions no
longer match.
scoutfs_file_write_iter was only checking for offline data in its write
region, not the entire file. Fix it to match the _aio_write method and
check the whole file.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_data_wait_check_iter() was checking the contiguous region of the
file starting at its pos and extending for iter_iov_count() bytes. The
caller can do that with the previous _data_wait_check() method by
providing the same count that _check_iter() was using.
Signed-off-by: Zach Brown <zab@versity.com>
The item cache has a bit of safety checks that make sure that an
operation is performed while holding a lock that covers the item. It
dumped a stack trace via WARN when that wasn't true, but it didn't
include any details about the keys or lock modes involved.
This adds a message that's printed once which includes the keys and
modes when an operation is attempted that isn't protected.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs_item_create() was checking that its lock had a read mode, when
it should have been checking for a write mode. This worked out because
callers with write mode locks are also protecting reads.
Signed-off-by: Zach Brown <zab@versity.com>
Unlink looks up the entry items for the name it is removing because we
no longer store the extra key material in dentries. If this lookup
fails it will use an error path which release a transaction which wasn't
held. Thankfully this error path is unlikely (corruption or systemic
errors like eio or enomem) so we haven't hit this in practice.
Signed-off-by: Zach Brown <zab@versity.com>
When we added the crtime creation timestamp to the inode we forgot to
update mkfs to set the crtime of the root inode.
Signed-off-by: Zach Brown <zab@versity.com>
Block reads can return ESTALE naturally as mounts read through old
cached blocks. We won't always log it as an error but we should add a
tracepoint that can be inspected.
Signed-off-by: Zach Brown <zab@versity.com>
This addresses some minor issues with how we handle driving the
weak-modules infrastructure for handling running on kernels not
explicitly built for.
For one, we now drive weak-modules at install-time more explicitly (it
was adding symlinks for all modules into the right place for the running
kernel, whereas now it only handles that for scoutfs against all
installed kernels).
Also we no longer leave stale modules on the filesystem after an
uninstall/upgrade, similar to what's done for vsm's kmods right now.
RPM's pre/postinstall scriptlets are used to drive weak-modules to clean
things up.
Note that this (intentionally) does not (re)generate initrds of any
kind.
Finally, this was tested on both the native kernel version and on
updates that would need the migrated modules. As a result, installs are
a little quicker, the module still gets migrated successfully, and
uninstalls correctly remove (only) the packaged module.
server_log_merge_free_work() is responsible for freeing all the input
log trees for a log merge operation that has finished. It looks for the
next item to free, frees the log btree it references, and then deletes
the item. It was doing this with a full server commit for each item
which can take an agonizingly long time.
This changes it perform multiple deletions in a commit as long as
there's plenty of alloc space. The moment the commit gets low it
applies the commit and opens a new one. This sped up the deletion of a
few hundred thousand log tree items from taking hours to seconds.
Signed-off-by: Zach Brown <zab@versity.com>
The btree_merge code was pinning leaf blocks for all input btrees as it
iterated over them. This doesn't work when there are a very large
number of input btrees. It can run out of memory trying to hold a
reference to a 64KiB leaf block for each input root.
This reworks the btree merging code. It reads a window of blocks from
all input trees to get a set of merged items. It can take multiple
passes to complete the merge but by setting the merge window large
enough this overhead is reduced. Merging now consumes a fixed amount of
memory rather than using memory proportional to the number of input
btrees.
Signed-off-by: Zach Brown <zab@versity.com>
Add a mount option for the amount of time that log merge creation can
wait before giving up. We add some counters so we can see how often
the timeout is being hit and what the average successfull wait time is.
Signed-off-by: Zach Brown <zab@versity.com>
The server sends sync requests to clients when it sees that they have
open log trees that need to be committed for log merging to proceed.
These are currently sent in the context of each client's get_log_trees
request, resulting in sync requests queued for one client from all
clients. Depending on message delivery and commit latencies, this can
create a sync storm.
The server's sends are reliable and the open commits are marked with the
seq when they opened. It's easy for us to record having sent syncs to
all open commits so that future attempts can be avoided. Later open
commits will have higher seqs and will get a new round of syncs sent.
Signed-off-by: Zach Brown <zab@versity.com>
The server was checking all client log_trees items to search for the
lowest commit seq that was still open. This can be expensive when there
are a lot of finalized log_trees items that won't have open seqs. Only
the last log_trees item for each client rid can be open, and the items
are sorted by rid and nr, so we can easily only check the last item for
each client rid.
Signed-off-by: Zach Brown <zab@versity.com>
During get_log_trees the server checks log_trees items to see if it
should start a log merge operation. It did this by iterating over all
log_trees items and there can be quite a lot of them.
It doesn't need to see all of the items. It only needs to see the most
recent log_trees item for each mount. That's enough to make the
decisions that start the log merging process.
Signed-off-by: Zach Brown <zab@versity.com>
KASAN could raise a spurious warning if the unwinder started in code
without ORC metadata and tried to access in the KASAN stack frame
redzones. This was fixed upstream but we can rarely see it in older
kernels. We can ignore these messages.
Signed-off-by: Zach Brown <zab@versity.com>
This test is trying to make sure that concurrent work isn't much, much,
slower than individual work. It does this by timing creating a bunch of
files in a dir on a mount and then timing doing the same in two mounts
concurrently. But it messed it up the concurrency pretty badly.
It had the concurrent createmany tasks creating files with a full path.
That means that every create is trying to read all the parent
directories. The way inode number allocation works means that one of
the mounts is likely to be getting a write lock that includes a shared
parent. This created a ton of cluster lock contention between the two
tasks.
Then it didn't sync the creates between phases. It could be
accidentally recording the time it took to write out the dirty single
creates as time taken during the parallel creates.
By syncing between phases and having the createmany tasks create files
relative to their per-mount directories we actually perform concurrent
work and test that we're not creating contention outside of the task
load.
This became a problem as we switched from loopback devices to device
mapper devices. The loopback writers were using buffered writes so we
were masking the io cost of constantly invalidating and refilling the
item cache by turning the reads into memory copies out of the page
cache.
While we're in here we actually clean up the created files and then use
t_fail to fail the test while the files still exist so they can be
examined.
Signed-off-by: Zach Brown <zab@versity.com>
Now that we're not setting up per-mount loopback devices we can not have
the loop module loaded until tests are running.
Signed-off-by: Zach Brown <zab@versity.com>
We don't directly mount the underlying devices for each mount because
the kernel notices multiple mounts and doesn't setup a new super block
for each.
Previously the script used loopback devices to create the local shared
block construct 'cause it was easy. This introduced corruption of
blocks that saw concurrent read and write IOs. The buffered kernel file
IO paths that loopback eventually degrades into by default (via splice)
could have buffered readers copying out of pages without the page lock
while writers modified the page. This manifest as occasional crc
failure of blocks that we knowingly issue concurrent reads and writes to
from multiple mounts (the quorum and super blocks).
This changes the script to use device-mapper linear passthrough devices.
Their IOs don't hit a caching layer and don't provide an opportunity to
corrupt blocks.
Signed-off-by: Zach Brown <zab@versity.com>
Our large fragmented free test creates pathologically file extents which
are as expensive as possible to free. We know that debugging kernels
can take a long time to do this so we can extend the hung task timeout.
Signed-off-by: Zach Brown <zab@versity.com>
One of the phases of this test wanted to delete files but got the glob
quoting wrong. This didn't matter for the original test but when we
changed the test to use its own xattr name then those existing undeleted
files got confused with other files in later phases of the test.
This changes the test to delete the files with a more reliable find
pattern instead of using shell glob expansion.
Signed-off-by: Zach Brown <zab@versity.com>
Previously the bulk_create_paths test tool used the same xattr name for
each category of xattrs it was creating.
This created a problem where two tests got their xattrs confused with
each other. The first test created a bunch of srch xattrs, failed, and
didn't clean up after itself. The second test saw these search xattrs
as its own and got very confused when there were far more srch xattrs
than it thought it had created.
This lets each test specify the srch xattr names that are created by
bulk_create_paths so that tests can work with their xattrs independent
of each other.
Signed-off-by: Zach Brown <zab@versity.com>
We just added a test to try and get srch compaction stuck by having an
input file continue at a specific offset. To exercise the bug the test
needs to perform 6 compactions. It needs to merge 4 sets of logs into 4
sorted files, it needs to make partial progress merging those 4 sorted
files into another file, and then finall attempt to continue compacting
from the partial progress offset.
The first version of the test didn't necessarily ensure that these
compactions happened. It created far too many log files then just
waited for time to pass. If the host was slow then the mounts may not
make it through the initial logs to try and compact the sorted files.
The triggers wouldn't fire and the test would fail.
These changes much more carefully orchestrate and watch the various
steps of compaction to make sure that we trigger the bug.
Signed-off-by: Zach Brown <zab@versity.com>