Add a coding style document that tries to record the conventions that
the project uses. It seemed more appropriate to put it up in the -kmod
git repo context rather than in src/ which would end up in fs/scoutfs/
upstream.
Signed-off-by: Zach Brown <zab@versity.com>
That comment looked very weird indeed until I recognized that I must
have forgotten to delete the first two attempts at starting the
sentence.
Signed-off-by: Zach Brown <zab@versity.com>
The very first greeting a client sends is unique becuase it doesn't yet
have a server_term field set and tells the server to create items to
track the client.
A server processing this request can create the items and then shut down
before the client is able to receive the reply. They'll resend the
greeting without server_term but then the next server will get -EEXIST
errors as it tries to create items for the client. This causes the
connection to break, which the client tries to reestablish, and the
pattern repeats indefinitely.
The fix is to simply recognize that -EEXIST is acceptable during item
creation. Server message handlers always have to address the case where
a resent message was already processed by a previous server but it's
response didn't make it to the client.
Signed-off-by: Zach Brown <zab@versity.com>
Remove an old client info field from the unmount barrier mechanism which
was removed a while ago. It used to be compared to a super field to
decide to finish unmount without reconnecting but now we check for our
mounted_client item in the server's btree.
Signed-off-by: Zach Brown <zab@versity.com>
Define a family field, and add a union for IPv4 and v6 variants, although
v6 is not supported yet.
Family field is now used to determine presence of address in a quorum slot,
instead of checking if addr is zero.
Signed-off-by: Andy Grover <agrover@versity.com>
The block-stale-reads test was built from the ashes of a test that
used counters and triggers to work with the btree when it was
only used on the server.
The initial quick translation to try and trigger block cache retries
while the forest called the btree got so much wrong. It was still
trying to use some 'cl' variable that didn't refer to the client any
more, the trigger helpers now call statfs to find paths and can end up
triggering themselves. and many more counters stale reads can happen
throughout the system while we're working -- not just one from our
trigger.
This fixes it up to consistently use fs numbers instead of
the silly stale cl variable and be less sensitive to triggers firing and
counter differences.
Signed-off-by: Zach Brown <zab@versity.com>
t_trigger_arm always output the value of the trigger after arming on the
premise that tests required the trigger being armed. In the process of
showing the trigger it calls a bunch of t_ helpers that build the path
to the trigger file using statfs_more to get the rid of mounts.
If the trigger being armed is in the server's mount and the specific
trigger test is fired by the server's statfs_more request processing
then the trigger can be fired before read its value. Tests can
inconsistently fail as the golden output shows the trigger being armed
or not depending on if it was in the server's mount or not.
t_trigger_arm_silent doesn't output the value of the armed trigger. It
can be used for low level triggers that don't rely on reading the
trigger's value to discover that their effect has happened.
Signed-off-by: Zach Brown <zab@versity.com>
Tests can use t_counter_diff to put a message in their golden output
when a specific change in counters is expected. This adds
t_counter_diff_changed to output a message that indicates change or not,
for tests that want to see counters change but the amount of change
doesn't need to be precisely known.
Signed-off-by: Zach Brown <zab@versity.com>
Each transaction maintains a global list of inodes to sync. It checks
the inode and adds it in each write_end call per OS page. Locking and
unlocking the global spinlock was showing up in profiles. At the very
least, we can only get the lock once per large file that's written
during a transaction. This will reduce spinlock traffic on the lock by
the number of pages written per file. We'll want a better solution in
the long run, but this helps for now.
Signed-off-by: Zach Brown <zab@versity.com>
Each transaction hold makes multiple calls to _alloc_meta_low to see if
the transaction should be committed to refill allocators before the
caller's hold is acquired and they can dirty blocks in the transaction.
_alloc_meta_low was using a spinlock to sample the allocator list_head
blocks to determine if there was space available. The lock and unlock
stores were creating significant cacheline contention.
The _alloc_meta_low calls are higher frequency than allocations. We can
use a seqlock to have exclusive writers and allow concurrent
_alloc_meta_low readers who retry if a writer intervenes.
Signed-off-by: Zach Brown <zab@versity.com>
We saw the transaction info lock showing up in profiles. We were doing
quite a lot of work with that lock held. We can remove it entirely and
use an atomic.
Instead of a locked holders count and writer boolean we can use an
atomic holders and have a high bit indicate that the write_func is
pending. This turns the lock/unlock pairs in hold and release into
atomic inc/cmpxchg/dec operations.
Then we were checking allocators under the trans lock. Now that we have
an atomic holders count we can increment it to prevent the writer from
commiting and release it after the checks if we need another commit
before the hold.
And finally, we were freeing our allocated reservation struct under the
lock. We weren't actually doing anything with the reservation struct so
we can use journal_info as the nested hold counter instead of having it
point to an allocated and freed struct.
Signed-off-by: Zach Brown <zab@versity.com>
As the implementation shifted away from the ring of btree blocks and LSM
segments we lost callers to all these triggers. They're unused and can
be removed.
Signed-off-by: Zach Brown <zab@versity.com>
The previous test that triggered re-reading blocks, as though they were
stale, was written in the era where it only hit btree blocks and
everything else was stored in LSM segments.
This reworks the test to make it clear that it affects all our block
readers today. The test only exercise the core read retry path, but it
could be expanded to test callers retrying with newer references after
they get -ESTALE errors.
Signed-off-by: Zach Brown <zab@versity.com>
Our block cache consistency mechanism allows readers to try and read
stale block references. They check block headers of the block they read
to discover if it has been modified and they should retry the read with
newer block references.
For this to be correct the block contents can't change under the
readers. That's obviously true in the simple imagined case of one node
writing and another node reading. But we also have the case where the
stale reader and dirtying writer can be concurrent tasks in the same
mount which share a block cache.
There were a two failure cases that derive from the order of readers and
writers working with blocks.
If the reader goes first, the writer could find the existing block in
the cache and modify it while the reader assumes that it is read only.
The fix is to have the writer always remove any existing cached block
and insert a newly allocated block into the cache with the header fields
already changed. Any existing readers will still have their cached
block references and any new readers will see the modified headers and
return -ESTALE.
The next failure comes from readers trying to invalidate dirty blocks
when they see modified headers. They assumed that the existing cached
block was old and could be dropped so that a new current version could
be read. But in this case a local writer has clobbered the reader's
stale block and the reader should immediately return -ESTALE.
Signed-off-by: Zach Brown <zab@versity.com>
To create dirty blocks in memory each block type caller currently gets a
reference on a created block and then dirties it. The reference it gets
could be an existing cached block that stale readers are currently
using. This creates a problem with our block consistency protocol where
writers can dirty and modify cached blocks that readers are currently
reading in memory, leading to read corruption.
This commit is the first step in addressing that problem. We add a
scoutfs_block_dirty_ref() call which returns a reference to a dirtied
block from the block core in one call. We're only changing the callers
in this patch but we'll be reworking the dirtying mechanism in an
upcoming patch to avoid corrupting readers.
Signed-off-by: Zach Brown <zab@versity.com>
Update scoutfs print to use the new block_ref struct instead of the
handful of per-block type ref structs that we had accumulated.
Signed-off-by: Zach Brown <zab@versity.com>
Each of the different block types had a reading function that read a
block and then checked their reference struct for their block type.
This gets rid of each block reference type and has a single block_ref
type which is then checked by a single ref reading function in the block
core. By putting ref checking in the core we no longer have to export
checking the block header crc, verifying headers, invalidating blocks,
or even reading raw blocks themseves. Everyone reads refs and leaves
the checking up to the core.
The changes don't have a significant functional effect. This is mostly
just changing types and moving code around. (There are some changes to
visible counters.)
This shares code, which is nice, but this is putting the block reference
checking in one place in the block core so that in a few patches we can
fix problems with writers dirtying blocks that are being read.
Signed-off-by: Zach Brown <zab@versity.com>
The block cache wasn't safely racing readers walking the rcu radix_tree
and the shrinker walking the LRU list. A reader could get a reference
to a block that had been removed from the radix and was queued for
freeing. It'd clobber the free's llist_head union member by putting the
block back on the lru and both the read and free would crash as they
each corrupted each other's memory. We rarely saw this in heavy load
testing.
The fix is to clean up the use of rcu, refcounting, and freeing.
First, we get rid of the LRU list. Now we don't have to worry about
resolving racing accesses of blocks between two independent structures.
Instead of shrinking walking the LRU list, we can mark blocks on access
such that shrinking can walk all blocks randomly and expect to quickly
find candidates to shrink.
To make it easier to concurrently walk all the blocks we switch to the
rhashtable instead of the radix tree. It also has nice per-bucket
locking so we can get rid of the global lock that protected the LRU list
and radix insertion. (And it isn't limited to 'long' keys so we can get
rid of the check for max meta blknos that couldn't be cached.)
Now we need to tighten up when read can get a reference and when shrink
can remove blocks. We have presence in the hash table hold a refcount
but we make it a magic high bit in the refcount so that it can be
differentiated from other references. Now lookup can atomically get a
reference to blocks that are in the hash table, and shrinking can
atomically remove blocks when it is the only other reference.
We also clean up freeing a bit. It has to wait for the rcu grace period
to ensure that no other rcu readers can reference the blocks its
freeing. It has to iterate over the list with _safe because it's
freeing as it goes.
Interestingly, when reworking the shrinker I noticed that we weren't
scaling the nr_to_scan from the pages we returned in previous shrink
calls back to blocks. We now divide the input from pages back into
blocks.
Signed-off-by: Zach Brown <zab@versity.com>
We had a mutex protecting the list of farewell requests. The critical
sections are all very short so we can use a spinlock and be a bit
clearer and more efficient. While we're at it, refactor freeing to free
outside of the criticial section.
Signed-off-by: Zach Brown <zab@versity.com>
The server has to be careful to only send farewell responses to quorum
clients once it knows that it won't need their vote to elect a leader to
server remaining clients.
The logic for doing this forgot to take non-quorum clients into account.
It would send farewell requests to all the final majority of quorum
members once they all tried to unmount. This could leave non-quorum
clients hung in unmount trying to send their farewell requests.
The fix is to count mouted_clients items for non-quorum clients and hold
off on sending farewell requests to the final majority until those
non-quorum clients have unmounted.
Signed-off-by: Zach Brown <zab@versity.com>
The recent quorum and unmount fixes should have addressed the failures
we were seeing in the mount-unmount-race test.
Signed-off-by: Zach Brown <zab@versity.com>
Update the man pages with descriptions of the new mkfs -Q quorum slot
configuration and quorum_slot_nr mount option.
Signed-off-by: Zach Brown <zab@versity.com>
We mask device numbers in command output to 0:0 so that we can have
consistent golden test output. The device number matching regex
responsible for this missed a few digits.
It didn't show up until we both tested enough mounts to get larger
device minor numbers and fixed multi-mount consistency so that the
affected tests didn't fail for other reasons.
Signed-off-by: Zach Brown <zab@versity.com>
Our test unmount function unmounted the device instead of the mount
point. It was written this way back in an old version of the harness
which didn't track mount points.
Now that we have mount points, we can just unmount that. This stops the
umount command from having to search through all the current mounts
looking for the mountpoint for the device it was asked to unmount.
Signed-off-by: Zach Brown <zab@versity.com>
I got a test failure where waiting returned an error, but it wasn't
clear what the error was or where it might have come from. Add more
logging so that we learn more about what might have gone wrong.
Signed-off-by: Zach Brown <zab@versity.com>
Update the example configuration in the README to specify the quorum
slots in mkfs arguments and mount options.
Signed-off-by: Zach Brown <zab@versity.com>
The mounted_clients btree stores items to track mounted clients. It's
modified by multiple greeting workers and the farewell work.
The greeting work was serialized by the farewell_mutex, but the
modifications in the farewell thread weren't protected. This could
result in modifications between the threads being lost if the dirty
block reference updates raced in just the right way. I saw this in
testing with deletions in farewell being lost and then that lingering
item preventing unmount because the server thought it had to wait for a
remaining quorum member to unmount.
We fix this by adding a mutex specifically to protect the
mounted_clients btree in the server.
Signed-off-by: Zach Brown <zab@versity.com>
As clients unmount they send a farewell request that cleans up
persistent state associated with the mount. The client needs to be sure
that it gets processed, and we must maintain a majority of quorum
members mounted to be able to elect a server to process farewell
requests.
We had a mechanism using the unmount_barrier fields in the greeting and
super_block to let the final unmounting quorum majority know that their
farewells have been processed and that they didn't need to keep trying
to reconnect.
But we missed that we also need this out of band farewell handling
signal for non-quorum member clients as well. The server can send
farewells to a non-member client as well as the final majority and then
tear down all the connections before the non-quorum client can see its
farewell response. It also needs to be able to know that its farewell
has been processed before the server let the final majority unmount.
We can remove the custom unmount_barrier method and instead have all
unmounting clients check for their mounted_client item in the server's
btree. This item is removed as the last step of farewell processing so
if the client sees that it has been removed it knows that it doesn't
need to resend the farewell and can finish unmounting.
This fixes a bug where a non-quorum unmount could hang if it raced with
the final majority unmounting. I was able to trigger this hang in our
tests with 5 mounts and 3 quorum members.
Signed-off-by: Zach Brown <zab@versity.com>
scoutfs mkfs had two block writing functions: write_block to fill out
some block header fields including crc calculation, and then
write_block_raw to pwrite the raw buffer to the bytes in the device.
These were used inconsistenly as blocks came and went over time. Most
callers filled out all the header fields themselves and called the raw
writer. write_block was only used for super writing, which made sense
because it clobbered the block's header with the super header so the
caller's set header magic and seq fields would be lost.
This cleans up the mess. We only have one block writer and the caller
provides all the hdr fields. Everything uses it instead of filling out
the fields themselves and calling the raw writer.
Signed-off-by: Zach Brown <zab@versity.com>
Add macros for stringifying either the name of a macro or its value. In
keeping with making our utils/ sort of look like kernel code, we use the
kernel stringify names.
Signed-off-by: Zach Brown <zab@versity.com>
Previously quorum configuration specified the number of votes needed to
elected the leader. This was an excessive amount of freedom in the
configuration of the cluster which created all sorts of problems which
had to be designed around.
Most acutely, though, it required a probabilistic mechanism for mounts
to persistently record that they're starting a server so that future
servers could find and possibly fence them. They would write to a lot
of quorum blocks and trust that it was unlikely that future servers
would overwrite all of their written blocks. Overwriting was always
possible, which would be bad enough, but it also required so much IO
that we had to use long election timeouts to avoid spurious fencing.
These longer timeouts had already gone wrong on some storage
configurations, leading to hung mounts.
To fix this and other problems we see coming, like live membership
changes, we now specifically configure the number and identity of mounts
which will be participating in quorum voting. With specific identities,
mounts now have a corresponding specific block they can write to and
which future servers can read from to see if they're still running.
We change the quorum config in the super block from a single
quorum_count to an array of quorum slots which specify the address of
the mount that is assigned to that slot. The mount argument to specify
a quorum voter changes from "server_addr=$addr" to "quorum_slot_nr=$nr"
which specifies the mount's slot. The slot's address is used for udp
election messages and tcp server connections.
Now that we specifically have configured unique IP addresses for all the
quorum members, we can use UDP messages to send and receive the vote
mesages in the raft protocol to elect a leader. The quorum code doesn't
have to read and write disk block votes and is a more reasonable core
loop that either waits for received network messages or timeouts to
advance the raft election state machine.
The quorum blocks are now used for slots to store their persistent raft
term and to set their leader state. We have event fields in the block
to record the timestamp of the most recent interesting events that
happened to the slot.
Now that raft doesn't use IO, we can leave the quorum election work
running in the background. The raft work in the quorum members is
always running so we can use a much more typical raft implementation
with heartbeats. Critically, this decouples the client and election
life cycles. Quorum is always running and is responsible for starting
and stopping the server. The client repeatedly tries to connect to a
server, it has nothing to do with deciding to participate in quorum.
Finally, we add a quorum/status sysfs file which shows the state of the
quorum raft protocol in a member mount and has the last messages that
were sent to or received from the other members.
Signed-off-by: Zach Brown <zab@versity.com>
As a client unmounts it sends a farewell request to the server. We have
to carefully manage unmounting the final quorum members so that there is
always a remaining quorum to elect a leader to start a server to process
all their farewell requests.
The mechanism for doing this described these clients as "voters".
That's not really right, in our terminology voters and candidates are
temporary roles taken on by members during a specific election term in
the raft protocol. It's more accurate to describe the final set of
clients as quorum members. They can be voters or candidates depending
on how the raft protocol timeouts workout in any given election.
So we rename the greeting flag, mounted client flag, and the code and
comments on either side of the client and server to be a little clearer.
This only changes symbols and comments, there should be no functional
change.
Signed-off-by: Zach Brown <zab@versity.com>
As we read the super we check the first and last meta and data blkno
fields. The tests weren't updated as we moved from one device to two
metadata and data devices.
Add a helper that tests the range for the device and test both meta and
data ranges fully, instead of only testing the endpoints of each and
assuming they're related because they're living on one device.
Signed-off-by: Zach Brown <zab@versity.com>
The mount-unmount-race test is occasionally hanging, disable it while we
debug it and have test coverage for unrelated work.
Signed-off-by: Zach Brown <zab@versity.com>
This is checked for by the kernel ioctl code, so giving unaligned values
will return an error, instead of aborting with an assert.
Signed-off-by: Andy Grover <agrover@versity.com>