Commit Graph

1003 Commits

Author SHA1 Message Date
Zach Brown
54644a5074 Add data_alloc_zone_blocks volume option
Add the data_alloc_zone_blocks volume option.  This changes the
behaviour of the server to try and give mounts free data extents which
fall in exclusive fixed-size zones.

We add the field to the scoutfs_volume_options struct and add it to the
set_volopt server handler which enforces constrains on the size of the
zones.

We then add fields to the log_trees struct which records the size of the
zones and sets bits for the zones that contain free extents in the
data_avail allocator root.  The get_log_trees handler is changed to read
all the zone bitmaps from all the items, pass those bitmaps in to
_alloc_move to direct data allocations, and finally update the bitmaps
in the log_trees items to cover the newly allocated extents.  The
log_trees data_alloc_zone fields are cleared as the mount's logs are
reclaimed to indicate that the mount is no longer writing to the zone.

The policy mechanism of finding free extents based on the bitmaps is
ipmlemented down in _data_alloc_move().

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
52c2a465db Add zone awareness to scoutfs_alloc_move()
Add parameters so that scoutfs_alloc_move() can first search for source
extents in specified zones.  It uses relatively cheap searches through
the order items to find extents that intersect with the regions
described by the zone bitmaps.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
bc4975fad4 Add scoutfs_alloc_extents_cb()
Add an allocator call for getting a callback for all the extents in
btree items in an allocator root.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:31:02 -07:00
Zach Brown
9de3ae6dcb Index free extents by order of length
Allocators store free extents in two items, one sorted by their blkno
position and the other by their precise length.

The length index makes it easy to search for precise extent lengths, but
it makes it hard to search for a large extent within a given blkno
region.  Skipping in the blkno dimension has to be done for every
precise length value.

We don't need that level of precision.  If we index the extents by a
coarser order of the length then we have a fixed number of orders in
which we have to skip in the blkno dimension when searching within a
specific region.

This changes the length item to be stored at the log(8) order of the
length of the extents.  This groups extents into orders that are close
to the human-friendly base 10 orders of magnitude.

With this change the order field in the key no longer stores the precise
extent length.  To preserve the length of the extent we need to use
another field.  The only 64bit field remaining is the first which is a
higher comparision priority than the type.  So we use the highest
comparison priority zone field to differentiate the position and order
indexes and can now use all three 64bit fields in the key.

Finally, we have to be careful when constructing a key to use _next when
searching for a large extent.  Previously keys were relying on the magic
property that building a key from an extent length of 0 ended up at the
key value -0 = 0.  That only worked because we never stored zero length
extents.  We now store zero length orders so we can't use the negative
trick anymore.  We explicitly treat 0 length extents carefully when
building keys and we subtract the order from U64_MAX to store the orders
from largest to smallest.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-21 15:25:56 -07:00
Zach Brown
0aa6005c99 Add volume options super, server, and sysfs
Introduce global volume options.  They're stored in the superblock and
can be seen in sysfs files that use network commands to get and
set the options on the server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-05-19 14:15:06 -07:00
Zach Brown
a5ca5ee36d Put back-to-back invalidated locks back on list
A lock that is undergoing invalidation is put on a list of locks in the
super block.  Invalidation requests put locks on the list.  While locks
are invalidated they're temporarily put on a private list.

To support a request arriving while the lock is being processed we
carefully manage the invalidation fields in the lock between the
invalidation worker and the incoming request.  The worker correctly
noticed that a new invalidation request had arrived but it left the lock
on its private list instead of putting it back on the invalidation list
for further processing.  The lock was unreachable, wouldn't get
invalidated, and caused everyone trying to use the lock to block
indefinitely.

When the worker sees another request arrive for an invalidating lock it
needs to move the lock from the private list back to the invalidation
list.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-30 10:00:07 -07:00
Zach Brown
603af327ac Ignore I_FREEING in all inode hash lookups
Previously we added a ilookup variant that ignored I_FREEING inodes
to avoid a deadlock between lock invalidation (lock->I_FREEING) and
eviction (I_FREEING->lock);

Now we're seeing similar deadlocks between eviction (I_FREEING->lock)
and fh_to_dentry's iget (lock->I_FREEING).

I think it's reasonable to ignore all inodes with I_FREEING set when
we're using our _test callback in ilookup or iget.  We can remove the
_nofreeing ilookup variant and move its I_FREEING test into the
iget_test callback provided to both ilookup and iget.

Callers will get the same result, it will just happen without waiting
for a previously I_FREEING inode to leave.  They'll get NULL instead of
waiting from ilookup.  They'll allocate and start to initialize a newer
instance of the inode and insert it along side the previous instance.

We don't have inode number re-use so we don't have the problem where a
newly allocated inode number is relying on inode cache serialization to
not find a previously allocated inode that is being evicted.

This change does allow for concurrent iget of an inode number that is
being deleted on a local node.  This could happen in fh_to_dentry with a
raw inode number.  But this was already a problem between mounts because
they don't have a shared inode cache to serialize them.  Once we fix
that between nodes, we fix it on a single node as well.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:22:10 -07:00
Zach Brown
ca320d02cb Get i_mutex before cluster lock in file aio_read
The vfs often calls filesystem methods with i_mutex held.  This creates
a natural ordering of i_mutex outside of cluster locks.  The file
aio_read method acquired i_mutex after its cluster lock, creating a
deadlock with other vfs methods like setattr.

The acquisition of i_mutex after the cluster lock was due to using the
pattern where we use the per-task lock to discover if we're the first
user of the lock in a call chain.  Readpage has to do this, but file
aio_read doesn't.  It should never be called recursively.  So we can
acquire the i_mutex outside of the cluster lock and warn if we ever are
called recursively.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-28 12:11:06 -07:00
Andy Grover
6eeaab3322 Merge pull request #35 from versity/zab/invalidate_already_pending
Handle back to back invalidation requests
2021-04-23 16:40:45 -07:00
Andy Grover
ac68d14b8d Merge pull request #36 from versity/zab/move_blocks_next_einval
Fix accidental EINVAL in move_blocks
2021-04-23 14:39:29 -07:00
Zach Brown
63148d426e Fix accidental EINVAL in move_blocks
When move blocks is staging it requires an overlapping offline extent to
cover the entire region to move.

It performs the stage by modifying extents at a time.  If there are
fragmented source extents it will modify each of them at a time in the
region.

When looking for the extent to match the source extent it looked from
the iblock of the start of the whole operation, not the start of the
source extent it's matching.  This meant that it would find a the first
previous online extent it just modified, which wouldn't be online, and
would return -EINVAL.

The fix is to have it search from the logical start of the extent it's
trying to match, not the start of the region.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-23 10:39:34 -07:00
Zach Brown
a27c54568c Handle back to back invalidation requests
The client's incoming lock invalidation request handler triggers a
BUG_ON if it gets a request for a lock that is already processing a
previous invalidation request.  The server is supposed to only send
one request at a time.

The problem is that the batched invalidation request handling will send
responses outside of spinlock coverage before reacquirin the lock and
finishing processing once the response send has been successful.

This gives a window for another invalidation request to arrive after the
response was sent but before the invalidation finished processing.  This
triggers the bug.

The fix is to mark the lock such that we can recognize a valid second
request arriving after we send the response but before we finish
processing.  If it arrives we'll continue invalidation processing with
the arguments from the new request.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-22 17:00:50 -07:00
Zach Brown
dfc2f7a4e8 Remove unused scoutfs_free_unused_locks nr arg
The nr argument wasn't used.  It always tries to free as many as the
shrinker call will let it.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
94dd86f762 Process lock invalidation after shutdown
Lock teardown during unmount involves first calling shutdown and then
destroy.  The shutdown call is meant to ensure that it's safe to tear
down the client network connections.  Once shutdown returns locking is
promising that it won't call into the client to send new lock requests.

The current shutdown implementation is very heavy handed and shuts down
everything.  This creates a deadlock.  After calling lock shutdown, the
client will send its farewell and wait for a response.  The server might
not send the farewell response until other mounts have unmounted if our
client is in the server's mount.  In this case we stil have to be
processing lock invalidation requests to allow other unmounting clients
to make forward progress.

This is reasonably easy and safe to do.  We only use the shutdown flag
to stop lock calls that would change lock state and send requests.  We
don't have it stop incoming requests processing in the work queueing
functions.  It's safe to keep processing incoming requests between
_shutdown and _destroy because the requests already come in through the
client.  As the client shuts down it will stop calling us.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
841d22e26e Disable task reclaim flags for block cache vmalloc
Even though we can pass in gfp flags to vmalloc it eventually calls pte
alloc functions which ignore the caller's flags and use user gfp flags.
This risks reclaim re-entering fs paths during allocations in the block
cache.  These allocs that allowed reclaim deep in the fs was causing
lockdep to add RECLAIM dependencies between locks and holler about
deadlocks.

We apply the same pattern that xfs does for disabling reclaim while
allocating vmalloced block payloads.  Setting PF_MEMALLOC_NOIO causes
reclaim in that task to clear __GFP_IO and __GFP_FS, regardless of the
individual allocation flags in the task, preventing recursion.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
2949b6063f Clear lock invalidate_pending during destroy
Locks have a bunch of state that reflects concurrent processing.
Testing that state determines when it's safe to free a lock because
nothing is going on.

During unmount we abruptly stop processing locks.  Unmount will send a
farewell to the server which will remove all the state associated with
the client that's unmounting for all its locks, regardless of the state
the locks were in.

The client unmount path has to clean up the interupted lock state and
free it, carefully avoiding assertions that would otherwise indicate
that we're freeing used locks.  The move to async lock invalidation
forgot to clean up the invalidation state.  Previously a synchronous
work function would set and clear invalidate_pending while it was
running.  Once we finished waiting for it invalidate_pending would be
clear.  The move to async invalidation work meant that we can still have
invalidate_pending with no work executing.  Lock destruction removed
locks from the invalidation list but forgot to clear the
invalidate_pending flag.

This triggered assertions during unmount that were otherwise harmless.
There was other use of the lock, we just forgot to clean up the lock
state.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
1e88aa6c0f Shutdown data after trans
The data_info struct holds the data allocator that is filled by
transactions as they commit.  We have to free it after we've shutdown
transactions.  It's more like the forest in this regard so we move its
desctruction down by the forest to group similar behaviour.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
d9aea98220 Shutdown locking before transactions
Shutting down the lock client waits for invalidation work and prevents
future work from being queued.  We're currently shutting down the
subsystems that lock calls before lock itself, leading to crashes if we
happen to have invalidations executing as we unmount.

Shutting down locking before its dependencies fixes this.  This was hit
in testing during the inode deletion fixes because it created the
perfect race by acquiring locks during unmount so that the server was
very unlikely to send invalidations on behalf to one mount on behalf of
another as they both unmounted.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
04f4b8bcb3 Perform final transaction write before shutdown
Shutting down the transaction during unmount relied on the vfs unmount
path to perform a sync of any remaining dirty transaction.  There are
ways that we can dirty a transaction during unmount after it calls
the fs sync, so we try to write any remaining dirty transaction before
shutting down.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
fead263af3 Remove unused sb_info shutdown
We're no longer using the shutdown field in our sb info struct.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
4389c73c14 Fix deadlock between lock invalidate and evict
We've had a long-standing deadlock between lock invalidation and
eviction.  Invalidating a lock wants to lookup inodes and drop their
resources while blocking locks.  Eviction wants to get a lock to perform
final deletion while the inodes has I_FREEING set which blocks lookups.

We only saw this deadlock a handful of times in all of the time we've
run the code, but it's now much more common now that we're acquiring
locks in iput to test that nlink is zero instead of only when nlink is
zero.  I see unmount hang regularly when testing final inode deletion.

This adds a lookup variant for invalidation which will refuse to
return freeing inodes so they won't be waited on.  Once they're freeing
they can't be seen by future lock users so they don't need to be
invalidated.  This keeps the lock invalication promise and avoids
sleeping on freeing inodes which creates the deadlock.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
715c29aad3 Proactively drop dentry/inode caches outside locks
Previously we wouldn't try and remove cached dentries and inodes as
lock revocation removed cluster lock coverage.  The next time
we tried to use the cached dentries or inodes we'd acquire
a lock and refresh them.

But now cached inodes prevent final inode deletion.  If they linger
outside cluster locking then any final deletion will need to be deferred
until all its cached inodes are naturally dropped at some point in the
future across the cluster.  It might take refreshing the dentries or for
memory pressure to push out the old cached inodes.

This tries to proctively drop cached dentries and inodes as we lose
cluster lock coverage if they're not actively referenced.  We need to be
careful not to perform final inode deletion during lock invalidation
because it will deadlock, so we defer an iput which could delete during
evict out to async work.

Now deletion can be done synchronously in the task that is performing
the unlink because previous use of the inode on remote mounts hasn't
left unused cached inodes sitting around.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
22371fe5bd Fully destroy inodes after all mounts evict
Today an inode's items are deleted once its nlink reaches zero and the
final iput is called in a local mount.  This can delete inodes from
under other mounts which have opened the inode before it was unlinked on
another mount.

We fix this by adding cached inode tracking.  Each mount maintains
groups of cached inode bitmaps at the same granularity as inode locking.
As a mount performs its final iput it gets a bitmap from the server
which indicates if any other mount has inodes in the group open.

This makes the two fast paths of opening and closing linked files and of
deleting a file that was unlinked locally only pay a moderate cost of
either maintaining the bitmap locally and only getting the open map once
per lock group.  Removing many files in a group will only lock and get
the open map once per group.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-21 12:17:33 -07:00
Zach Brown
c6fd807638 Use recov to manage lock recovery
Now that we have the recov layer we can have the lock server use it to
track lock recovery.  The lock server no longer needs its own recovery
tracking structures and can instead call recov.  We add a call for the
server to call to kick lock processing once lock recovery finishes.  We
can get rid of the persistent lock_client items now that the server is
driving recovery from the mounted_client items.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
592f472a1c Use recov in server to recover client greetings
The server starts recovery when it finds mounted client items as it
starts up.  The clients are done recovering once they send their
greeting.  If they don't recover in time then they'll be fenced.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
a65775588f Add server recovery helpers
Add a little set of functions to help the server track which clients are
waiting to recover which state.  The open map messages need to wait for
recovery so we're moving recovery out of being only in the lock server.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
da1af9b841 Add scoutfs inode ino lock coverage
Add lock coverage which tracks if the inode has been refreshed and is
covered by the inode group cluster lock.  This will be used by
drop_inode and evict_inode to discover that the inode is current and
doesn't need to be refreshed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
accd680a7e Fix block setup always returning 0
Another case of returning 0 instead of ret.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 12:10:35 -07:00
Zach Brown
c3290771a0 Block cache use rht _lookup_ insert for EEXIST
The sneaky rhashtable_insert_fast() can't return -EEXIST despite the
last line of the function *REALLY* making it look like it can.  It just
inserts new objects at the head of the bucket lists without comparing
the insertion with existing objects.

The block cache was relying on insertion to resolve duplicate racing
allocated blocks.  Because it couldn't return -EEXIST we could get
duplicate cached blocks present in the hash table.

rhashtable_lookup_insert_fast() fixes this by actually comparing the
inserted objects key with the objects found in the insertion bucket.  A
racing allocator trying to insert a duplicate cached block will get an
error, drop their allocated block, and retry their lookup.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 09:24:23 -07:00
Zach Brown
cf3cb3f197 Wait for rhashtable to rehash on insert EBUSY
The rhashtable can return EBUSY if you insert fast enough to trigger an
expansion of the next table size that is waiting to be rehashed in an
rcu callback.  If we get EBUSY from rhasthable_insert we call
synchronize_rcu to wait for the rehash to complete before trying again.

This was hit in testing restores of a very large namespace and took a
few hours to hit.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-13 09:24:23 -07:00
Zach Brown
9ee7f7b9dc Block cache shrink restart waits for rcu callbacks
We're seeing cpu livelocks in block shrinking where counters show that a
single block cache shrink call is only getting EAGAIN from repeated
rhashtable walk attempts.  It occurred to me that the running task might
be preventing an RCU grace period from ending by never blocking.

The hope of this commit is that by waiting for rcu callbacks to run
we'll ensure that any pending rebalance callback runs before we retry
the rhashtable walk again.  I haven't been able to reproduce this easily
so this is a stab in the dark.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-07 12:50:50 -07:00
Andy Grover
4630b77b45 cleanup: Use flexible array members instead of 0-length arrays
See Documentation/process/deprecated.rst:217, items[] now preferred over
items[0].

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-07 10:14:47 -07:00
Andy Grover
bdc43ca634 cleanup: Fix ESTALE handling in forest_read_items
Kinda weird to goto back to the out label and then out the bottom. Just
return -EIO, like forest_next_hint() does.

Don't call client_get_roots() right before retry, since is the first thing
retry does.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-07 10:14:04 -07:00
Andy Grover
6406f05350 cleanup: Remove struct net_lock_grant_response
We're not using the roots member of this struct, so we can just
use struct scoutfs_net_lock directly.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-07 10:13:56 -07:00
Andy Grover
820b7295f0 cleanup: Unused LIST_HEADs
Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 16:23:41 -07:00
Zach Brown
b3611103ee Merge pull request #26 from agrover/tmpfile
Support O_TMPFILE and allow MOVE_BLOCKS into released extents
2021-04-05 15:23:41 -07:00
Andy Grover
0deb232d3f Support O_TMPFILE and allow MOVE_BLOCKS into released extents
Support O_TMPFILE: Create an unlinked file and put it on the orphan list.
If it ever gains a link, take it off the orphan list.

Change MOVE_BLOCKS ioctl to allow moving blocks into offline extent ranges.
Ioctl callers must set a new flag to enable this operation mode.

RH-compat: tmpfile support it actually backported by RH into 3.10 kernel.
We need to use some of their kabi-maintaining wrappers to use it:
use a struct inode_operations_wrapper instead of base struct
inode_operations, set S_IOPS_WRAPPER flag in i_flags. This lets
RH's modified vfs_tmpfile() find our tmpfile fn pointer.

Add a test that tests both creating tmpfiles as well as moving their
contents into a destination file via MOVE_BLOCKS.

xfstests common/004 now runs because tmpfile is supported.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-04-05 14:23:44 -07:00
Zach Brown
1259f899a3 srch compaction needs to prepare alloc for commit
The srch client compaction work initializes allocators, dirties blocks,
and writes them out as its transaction.  It forgot to call the
pre-commit allocator prepare function.

The prepare function drops block references used by the meta allocator
during the transaction.  This leaked block references which kept blocks
from being freed by the shrinker under memory pressure.  Eventually
memory was full of leaked blocks and the shrinker walked all of them
looking blocks to free, resulting in an effective livelock that ground
the system to a crawl.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-01 13:04:40 -07:00
Zach Brown
2d393f435b Warn on leaked block refs on unmount
By the time we get to destroying the block cache we should have put all
our block references.  Warn as we tear down the blocks if we see any
blocks that still have references, implying a ref leak.  This caught a
leak caused by srch compaction forgetting to put allocator list block
refs.

Signed-off-by: Zach Brown <zab@versity.com>
2021-04-01 13:04:06 -07:00
Zach Brown
3de703757f Fix weird comment editing error
That comment looked very weird indeed until I recognized that I must
have forgotten to delete the first two attempts at starting the
sentence.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-16 12:02:05 -07:00
Zach Brown
7d67489b0c Handle resent initial client greetings
The very first greeting a client sends is unique becuase it doesn't yet
have a server_term field set and tells the server to create items to
track the client.

A server processing this request can create the items and then shut down
before the client is able to receive the reply.  They'll resend the
greeting without server_term but then the next server will get -EEXIST
errors as it tries to create items for the client.  This causes the
connection to break, which the client tries to reestablish, and the
pattern repeats indefinitely.

The fix is to simply recognize that -EEXIST is acceptable during item
creation.  Server message handlers always have to address the case where
a resent message was already processed by a previous server but it's
response didn't make it to the client.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-16 11:56:26 -07:00
Zach Brown
73084462e9 Remove unused client greeting_umb
Remove an old client info field from the unmount barrier mechanism which
was removed a while ago.  It used to be compared to a super field to
decide to finish unmount without reconnecting but now we check for our
mounted_client item in the server's btree.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-16 10:04:42 -07:00
Andy Grover
efe5d92458 Reserve space in superblock for IPv6 addresses
Define a family field, and add a union for IPv4 and v6 variants, although
v6 is not supported yet.

Family field is now used to determine presence of address in a quorum slot,
instead of checking if addr is zero.

Signed-off-by: Andy Grover <agrover@versity.com>
2021-03-12 14:10:42 -08:00
Zach Brown
513d6b2734 Merge pull request #20 from versity/zab/remove_trans_spinlock
Zab/remove trans spinlock
2021-03-04 13:59:07 -08:00
Zach Brown
f8d39610a2 Only get inode writeback_lock when adding inodes
Each transaction maintains a global list of inodes to sync.  It checks
the inode and adds it in each write_end call per OS page.  Locking and
unlocking the global spinlock was showing up in profiles.  At the very
least, we can only get the lock once per large file that's written
during a transaction.  This will reduce spinlock traffic on the lock by
the number of pages written per file.   We'll want a better solution in
the long run, but this helps for now.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-04 11:39:30 -08:00
Zach Brown
c470c1c9f6 Allow read-mostly _alloc_meta_low
Each transaction hold makes multiple calls to _alloc_meta_low to see if
the transaction should be committed to refill allocators before the
caller's hold is acquired and they can dirty blocks in the transaction.

_alloc_meta_low was using a spinlock to sample the allocator list_head
blocks to determine if there was space available.  The lock and unlock
stores were creating significant cacheline contention.

The _alloc_meta_low calls are higher frequency than allocations.  We can
use a seqlock to have exclusive writers and allow concurrent
_alloc_meta_low readers who retry if a writer intervenes.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-04 11:39:30 -08:00
Zach Brown
e163f3b099 Use atomic holders instead of trans info lock
We saw the transaction info lock showing up in profiles.  We were doing
quite a lot of work with that lock held.  We can remove it entirely and
use an atomic.

Instead of a locked holders count and writer boolean we can use an
atomic holders and have a high bit indicate that the write_func is
pending.  This turns the lock/unlock pairs in hold and release into
atomic inc/cmpxchg/dec operations.

Then we were checking allocators under the trans lock.  Now that we have
an atomic holders count we can increment it to prevent the writer from
commiting and release it after the checks if we need another commit
before the hold.

And finally, we were freeing our allocated reservation struct under the
lock.  We weren't actually doing anything with the reservation struct so
we can use journal_info as the nested hold counter instead of having it
point to an allocated and freed struct.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 14:18:04 -08:00
Zach Brown
a508baae76 Remove unused triggers
As the implementation shifted away from the ring of btree blocks and LSM
segments we lost callers to all these triggers.  They're unused and can
be removed.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:50:00 -08:00
Zach Brown
208c51d1d2 Update stale block reading test
The previous test that triggered re-reading blocks, as though they were
stale, was written in the era where it only hit btree blocks and
everything else was stored in LSM segments.

This reworks the test to make it clear that it affects all our block
readers today.  The test only exercise the core read retry path, but it
could be expanded to test callers retrying with newer references after
they get -ESTALE errors.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:50:00 -08:00
Zach Brown
9450959ca4 Protect stale block readers from local dirtying
Our block cache consistency mechanism allows readers to try and read
stale block references.  They check block headers of the block they read
to discover if it has been modified and they should retry the read with
newer block references.

For this to be correct the block contents can't change under the
readers.  That's obviously true in the simple imagined case of one node
writing and another node reading.  But we also have the case where the
stale reader and dirtying writer can be concurrent tasks in the same
mount which share a block cache.

There were a two failure cases that derive from the order of readers and
writers working with blocks.

If the reader goes first, the writer could find the existing block in
the cache and modify it while the reader assumes that it is read only.
The fix is to have the writer always remove any existing cached block
and insert a newly allocated block into the cache with the header fields
already changed.  Any existing readers will still have their cached
block references and any new readers will see the modified headers and
return -ESTALE.

The next failure comes from readers trying to invalidate dirty blocks
when they see modified headers.  They assumed that the existing cached
block was old and could be dropped so that a new current version could
be read.  But in this case a local writer has clobbered the reader's
stale block and the reader should immediately return -ESTALE.

Signed-off-by: Zach Brown <zab@versity.com>
2021-03-01 09:49:59 -08:00