Commit Graph

1544 Commits

Author SHA1 Message Date
Zach Brown
285b68879a Set quorum config ver to 1 in mkfs and print
We're adding a command to change the quorum config which updates its
version number.  Let's make the version a little more visible and start
it at the more humane 1.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 15:41:04 -08:00
Zach Brown
1ac3efe701 Add meta_super_in_use utils helper
Move the code that checks that the super is in use from
change-format-version into its own function in util.c.   We'll use it in
an upcoming command to change the quorum config.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 15:40:25 -08:00
Zach Brown
ce76682db7 Make mkfs quorum helpers available
Move functions for printing and validating the quorum config from mkfs.c
to quorum.c so that they can be used in an upcoming command to change
the quorum config.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 13:44:51 -08:00
Zach Brown
686f8515bc Fix --quorum-count typo in mkfs error message
The change from --quorum-count to --quorum-slot forgot to update a
mention of the option in an error message in mkfs when it wasn't
provided.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-24 13:44:51 -08:00
Zach Brown
93bc52cc54 Merge pull request #60 from bgly/bduffyly/block_stale_reads
Fix block-stale-read test case
2021-11-24 10:25:26 -08:00
Zach Brown
1108d1288a Merge pull request #61 from bgly/bduffyly/rename2
Add basic renameat2 syscall support
2021-11-24 10:24:23 -08:00
Bryant G. Duffy-Ly
0abcd5a004 Take generic/025/078 off expunge list adding 23/24
We want to enable the test case for:
generic/023 - tests that renameat2 syscall exists
generic/024 - renameat2 with NOREPLACE flag

Move both generic/025 and 078 to the no run list so that
we can test the unsupported output if the flags were
passed that were not supported.

Example output:
generic/025      [not run] fs doesn't support RENAME_EXCHANGE
generic/078      [not run] fs doesn't support RENAME_WHITEOUT

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-19 17:54:19 -06:00
Bryant G. Duffy-Ly
888ad8ec5c Add renameat2 unit test case
The goal of the test case is to have two mount points
with two async calls made to do renameat2. This allows
for two calls to race to call renameat2 RENAME_NOREPLACE.
When this happens you expect one of them to fail with a
-EEXIST. This would validate that the new flag works.
Essentially one of the two calls to renameat should hit the
new RENAME_NOREPLACE code and exit early.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-19 17:54:13 -06:00
Bryant G. Duffy-Ly
16ea0ef671 Add syscall wrapper for renameat2
Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-19 17:54:08 -06:00
Bryant G. Duffy-Ly
1b8e3f7c05 Add basic renameat2 syscall support
Support generic renameat2 syscall then add support for the
RENAME_NOREPLACE flag. To suppor the flag we need to check
the existance of both entries and return -EXIST.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-19 17:54:02 -06:00
Bryant G. Duffy-Ly
3ae0ebd0d8 Fix block-stale-read test case
The current test case attempts to create a state to read
by calling setattr and getattr in attempt to force block
cache reads. It so happens that this does not always force
cache block reads, which in rare cases causes this test case
to fail.

The new test case removes all the extra bouncing around of mount
points and we just directly call scoutfs df which will walk
everyone's allocators to summarize the block counts, which is
guaranteed to exist. Therefore, we do not have to create any sort
of state prior to trying to force a read.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-19 15:41:54 -06:00
Zach Brown
714b7f2a84 Merge pull request #54 from bgly/bduffyly/abort_conn
Fix client/server abort conn on force unmount
2021-11-09 13:29:20 -08:00
Zach Brown
945f8b4828 Merge pull request #58 from bgly/bduffyly/print_data
Fix scoutfs print <data_dev> hang
2021-11-09 09:50:14 -08:00
Zach Brown
b5ccefeeb9 Merge pull request #59 from versity/zab/v1_release_notes
Add release notes with the 1.0 GA release
v1.0
2021-11-08 16:09:42 -08:00
Zach Brown
ea08942824 Add release notes with the 1.0 GA release
Let's try maintaining release notes in a file in the repo.  There are
lots of schemes for associating commits and release notes and this seems
like the simplest place to start.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-08 14:42:33 -08:00
Bryant G. Duffy-Ly
95f2a87864 Fix scoutfs print <data_dev> hang
If a user tries to print a data device exit early if
it is data device.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-08 16:16:13 -06:00
Bryant G. Duffy-Ly
38ee2defd5 Add a filter for forced unmount error output
[85164.299902] scoutfs f.8c19e1.r.facf2e error: server error writing btree blocks: -5
[144308.589596] scoutfs f.c9397a.r.8ae97f error: server error -5 freeing merged btree blocks: looping commit del/upd freeing item
[174646.005596] scoutfs f.15f0b3.r.1862df error: server error -5 freeing merged btree blocks: final commit del/upd freeing item
[146653.893676] scoutfs f.c7f188.r.34e23c error: server error writing super block: -5
[273218.436675] scoutfs f.dd4157.r.f0da7e error: server failed to bind to 127.0.0.1:42002, err -98
[376832.542823] scoutfs f.049985.r.1a8987 error: error -5 reading quorum block 19 to update event 1 term 3

The above is an example output that will be filtered out

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-08 07:36:02 -06:00
Bryant G. Duffy-Ly
0fc8ccb122 Fix exiting out of btree_walk early for force_umnt
We do not want to short-circuit btree_walk early, it is
better to handle the force unmount on the caller side.
Therefore, remove this from btree_walk.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-05 15:21:09 -05:00
Bryant G. Duffy-Ly
e4a3c2b95d Break client/server out of waiting network replies
If there is a forced unmount we call _net_shutdown from
umount_begin in order to tell the server and clients to
break out of pending network replies. We then add the call
to abort within the shutdown_worker since most of the mucking
with send and resend queues are all done there.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-05 15:21:04 -05:00
Bryant G. Duffy-Ly
cf4e6611d3 Fix inconsistency assertions at commit_log_merge
Only BUG_ON for inconsistency and not do it for commit errors
or failure to delete the original request.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-05 15:18:57 -05:00
Bryant G. Duffy-Ly
65429a9cc4 Ensure that writer_init and alloc_init are cleaned
In scoutfs_server_worker we do not properly handle the clean up
of _block_writer_init and alloc_init. On error paths we can clean
up the context if either of thoes are initialized we can call
alloc_prepare_commit or writer_forget_all to ensure we drop
the block references and clear the dirty status of all the blocks
in the writer.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-05 15:05:42 -05:00
Zach Brown
d764ed7c43 Merge pull request #57 from versity/zab/update_readme
Update README.md
2021-11-05 11:34:44 -07:00
Zach Brown
465e5ee769 Update README.md
Remove a bunch of old language from the README.  We're no longer in the
early days of the open release so we can remove all the alpha quality
language.   And the system has grown sufficiently that the repo README
isn't a great place for a small getting started doc.  There just isn't
room to do the subject justice.   If we need such a thing for the
project we'll put it as a first order doc in the repo that'd be
distributed along with everything else.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-05 11:16:57 -07:00
Bryant G. Duffy-Ly
83a6bbb640 Fix inconsistency in server_log_merge_free_work
In order to safely free blocks we need to first dirty
the work. This allows for resume later on without a double
free.

Signed-off-by: Bryant G. Duffy-Ly <bduffyly@versity.com>
2021-11-03 17:09:51 -05:00
Zach Brown
f02d68f567 Merge pull request #55 from versity/zab/v1_format_version
Zab/v1 format version
2021-11-03 10:18:50 -07:00
Zach Brown
5d6a510e25 Merge pull request #56 from versity/zab/xattr_shrink_bad_items
Fix xattr update out of bounds access
2021-11-02 10:17:06 -07:00
Zach Brown
1b4d291bf7 Fix xattr update out of bounds access
As we update xattrs we need to update any existing old items with the
contents of the new xattr that uses those items.   The loop that updated
existing items only took the old xattr size into account and assumed
that the new xattr would use those items.   If the new xattr size used
fewer parts then the attempt to update all the old parts that weren't
covered by the new size would go very wrong.   The length of the region
in the new xattr would be negative so it'd try to use the max part
length.  Worse, it'd copy these max part length regions outside the
input new xattr buffer.  Typically this would land in addressible memory
and copy garbage into the unused old items before they were later
deleted.

However, it could access so far outside the input buffer that it could
cross a page boudary into inaccessible memory and fault.  We saw this in
the field while trying to repeatedly incrementally shrink a large xattr.

This fixes the loop that updates overlapping items between the new and
old xattr to start with the smaller of their two item counts.  Now it
will only update items that are actually used by both xattrs and will
only safely access the new xattr input buffer.

Signed-off-by: Zach Brown <zab@versity.com>
2021-11-01 11:33:17 -07:00
Zach Brown
223ee5deef Declare v1 of the stable persistent format
From now on if we make incompatible changes to structures or messages
then we update the format version and ensure that the code can deal with
all the versions in its supported range.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
8f60ac06c5 Clean up our ioctl numbers
We had arbitrarily chosen an ioctl code 's' to match scoutfs, but of
course that conflicts.  This chooses an arbitrary hole in the upstream
reservations from inode-numbers.rst.

Then we make sure to have our _IO[WR] usage reflect the direction of the
final type paramater.  For most of our ioctls userspace is writing an
argument parameter to perform an operation (that often has side
effects).   Most of our ioctls should be _IOW because userspace is
writing the parameter, not _IOR (though the operation tends to read
state).  A few ioctls copy output back to userspace in the parameter so
they're _IOWR.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
932a842ae3 Remove valid_bytes from stat _more ioctls
The idea here was that we'd expand the size of the struct and
valid_bytes would tell the kernel which fields were present in
userspace's struct.  That doesn't combine well with the ioctl convention
of having the size of the type baked into the ioctl number.   We'll
remove this to make the world less surprising.  If we expand the
interface we'd add additional ioctls and types.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
618a7a4c47 Remove unused lock server alloc and wri
While checking in on some other code I noticed that we have lingering
allocator and writer contexts over in the lock server.  The lock server
used to manage its own client state and recovery.  We've sinced moved
that into shared recov functionality in the server.  The lock server no
longer manipulates its own btrees and doesn't need these unused
references to the server's contexts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
9ebf43db99 Spread out key zone and type values
Introduce some space between the current key zone and type values so
that we have room to insert new keys amongst the current keys if we need
to.   A spacing of 4 is arbitrarily chosen as small enough to still give
us intuitively small numbers while leaving enough room to grow, given
how long its taken to come to the current number of keys.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
e38beee85a Stop using inode index key type as array index
The code that updates inode index items on behalf of indexed fields uses
an array to track changes in the fields.  Those array indexes were the
raw key type values.

We're about to introduce some sparse space between all the key values so
that we have some room to add keys in the future at arbitrary sort
positions amongst the previous keys.

We don't want the inode index item updating code to keep using raw types
as array indices when the type values are no longer small dense values.
We introduce indirection from type values to array indices to keep the
tracking array in the in-memory inode struct small.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
20ac2e35fa Remove clock_sync field from net message
As we freeze the format let's remove this old experiment to try and make
it easier to line up traces from different mounts.   It never worked
particularly well and I think it could be argued that trying to merge
trace logs on different machines isn't a particularly meaningful thing
to do.   You care about how they interact not what they were doing at
the same time with their indepdendent resources.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
80ee2c6d57 Harden client transaction processing
There are a few bad corner cases in the state machine that governs how
client transactions are opened, modified, and committed.

The worst problem is on the server side.   All server request handlers
need to cope with resent requests without causing bad side effects.
Both get_log_trees and commit_log_trees would try to fully processes
resent requests.  _get_log_trees() looks safe because it works with the
log_trees that was stored previously.  _commit_log_trees() is not safe
because it can rotate out the srch log file referenced by the sent
log_trees every time it's processed.  This could create extra srch
entries which would delete the first instance of entries.  Worse still,
by injecting the same block structure into the system multiple times it
ends up causing multiple frees of the blocks that make up the srch file.

The client side problems are slightly different, but related.   There
aren't strong constraints which guarantee that we'll only send a commit
request after a get request succeeds.   In crazy circumstances the
commit request in the write worker could come before the first get in
mount succeeds.   Far worse is that we can send multiple commit requests
for one transaction if it changes as we get errors during multiple
queued write attempts, particularly if we get errors from get_log_trees
after having successfully committed.

This hardens all these paths to ensure a strict sequence of
get_log_trees, transaction modification, and commit_log_trees.

On the server we add *_trans_seq fields to the log_trees struct so that
both get_ and commit_ can see that they've already prepared a commit to
send or have already committed the incoming commit, respectively.   We
can use the get_trans_seq field as the trans_seq of the open transaction
and get rid of the entire seperate mechanism we used to have for
tracking open trans seqs in the clients.  We can get the same info by
walking the log_trees and looking at their *_trans_seq fields.

In the client we have the write worker immediately return success if
mount hasn't opened the first transaction.   Then we don't have the
worker return to allow further modification until it has gotten success
from get_log_trees.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
42c4c6dd24 Move transaction sbi fields to trans_info
The transaction code was built a million years ago and put all of its
data in our core super block info.   This finally moves the rest of the
private transaction fields out of the core super block and into the
transaction info.   This makes it clear that it's private to trans.c and
brings it line with the rest of the subsystems in the tree.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
7d71b610af Add server extent motion tracking
Add tracking in the alloc functions that the server uses to move extents
between allocator structures on behalf of client mounts.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
70ede28e39 Remove unused traced_extent leavings
Remove some lingering support helpers for the traced_extent struct that
we haven't used in a while.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
b477604339 Don't clobber srch compact errors
The srch compaction worker will wait a bit before attempting another
compaction as it finishes a compaction that failed.

Unfortunately, it clobbered the errors it got during compaction with the
result of sending the commit to the server with the error flag.  If the
commit is successful then it thinks there were no errors and immediately
re-queues itself to try the next compaction.

If the error is persistent, as it was with a bug in how we merged log
files with a single page's worth of entries, then we can spin
indefinitely getting and error, clobbering the error with the commit
result, and immediately queueing our work to do it all over again.

This fix preserves existing errors when geting the result of the commit
and will correctly back off.  If we get persistent merge errors at least
they won't consume significant resources.  We add a counter for commit
for the errors so we can get some visibility if this happens.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
75f9aabe75 Allow compacting logs down to a single page
The k-way merge function at the core of the srch file entry merging had
some bookkeeping math (calculating number of parents) that couldn't
handle merging a single incoming entry stream, so it threw a warning and
returned an error.  When refusing to handle that case, it was assuming
that caller was trying to merge down a single log file which doesn't
make any sense.

But in the case of multiple small unsorted logs we can absolutely end up
with their entries stored in one sorted page.   We have one sorted input
page that's merging multiple log files.  The merge function is also the
path that writes to the output file so we absolutely need to handle this
case.

We more carefully calculate the number of parents, clamping it to one
parent when we'd otherwise get "(roundup(1) -> 1) - 1 == 0" when
calculating the number of parents from the number of inputs.  We can
relax the warning and error to refuse to merge nothing.

The test triggers this case by putting single search entries in the log
files for mounts and unmounting them to force rotation of the mount log
files into mergable rotated log files.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
cf512c5fcf Use inode_count field for statfs file counts
Our statfs implementation had clients reading the super block and using
the next free inode number to guess how many inodes there might be.  We
are very aggressive with giving directories private pools of inode
numbers to allocate from.   They're often not used at all, creating huge
gaps in allocated inode numbers.   The ratio of the average number of
allocations per directory to the batch size given to each directory is
the factor that the used inode count can be off by.

Now that we have a precise count of active inodes we can use that to
return accurate counts of inodes in the files fields in the statfs
struct.  We still don't have static inode allocation so the fields don't
make a ton of sense.  We fake the total and free count to give a
reasonable estimate of the total files that doesn't change while the
free count is calculated from the correct count of used inodes.

While we're at it we add a request to get the summed fields that the
server can cheaply discover in cache rather than having the client
always perform read IOs.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
a53d6d1a8e Add scoutfs_alloc_foreach_super which takes super
Add an alloc_foreach variant which uses the caller's super to walk the
allocators rather than always reading it off the device.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
95ed36f9d3 Maintain inode count in super and log trees
Add a count of used inodes to the super block and a change in the inode
count to the log_trees struct.   Client transactions track the change in
inode count as they create and delete inodes.   The log_trees delta is
added to the count in the super as finalized log_trees are deleted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:47 -07:00
Zach Brown
94e5bc1457 Remove unused scoutfs_last_ino()
Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
366f615c9f Add support for our format version
We had previously started on a relatively simple notion of an
interoperability version which wasn't quite right.  This fleshes out
support for a more functional format version.   The super blocks have a
single version that defines behaviour of the running system.   The code
supports a range of versions and we add some initial interfaces for
updating the version while the system is offline.   All of this together
should let us safely change the underlying format over time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
ac2587017e Add write_nr to quorum blocks
Add a write_nr field to the quorum block header which is incremented
with every write.  Each event also gets a write_nr field that is set to
the incremented value from the header.   This gives us a history of the
order of event updates that isn't sensitive to misconfigured time.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
1cdcf41ac7 Move more block read/write functions to util
We're adding another command that does block IO so move some block
reading and writing functions out of mkfs.   We also grow a few function
variants and call the write_sync variant from mkfs instead of having it
manually sync.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
024426df28 Add a file for userspace quorum config helpers
Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
a0690070ae Don't null terminate our note strings
The code that shows the note sections as files uses the section size to
define the size of the notes payload.  We don't need to null terminate
the strings to define their lengths.  Doing so puts a null in the notes
file which isn't appreciated by many readers.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00
Zach Brown
4e00f95014 run-tests builds our targets with -j
The test harness might as well use all cpus when building.  It's
reasonably safe to assume both that the test systems are otherwise idle
and that the build is likely to succeed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-10-28 12:30:46 -07:00