Compare commits

...

7 Commits

Author SHA1 Message Date
Zach Brown
96049fe4f9 Update tracing with cluster lock changes
Signed-off-by: Zach Brown <zab@versity.com>
2025-10-31 15:38:31 -05:00
Zach Brown
6b67aee2e3 Directly queue cluster lock work
We had a little helper that scheduled work after testing the list, which
required holding the spinlock.  This was a little too crude and required
scoutfs_unlock() acquiring the invalidate work list spinlock even though
it already had the cluster lock spinlock held and could see that there
are invalidate requests pending and should queue the invalidation work.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-31 15:38:31 -05:00
Zach Brown
09fe4fddd4 Two cluster lock LRU lists with less precision
Currently we maintain a single LRU list of cluster locks and every time
we acquire a cluster lock we move it to the head of the LRU, creating
significant contention acquiring the spinlock that protects the LRU
list.

This moves to two LRU lists, a list of cluster locks ready to be
reclaimed and one for locks that are in active use.  We mark locks with
which list they're on and only move them to the active list if they're
on the reclaim list.  We track imbalance between the two lists so that
they're always roughly the same size.

This removes contention maintaining a precise LRU amongst a set of
active cluster locks.  It doesn't address contention creating or
removing locks, which are already very expensive operations.

It also loses strict ordering by access time.  Reclaim has to make it
through the oldest half of locks before getting to the newer half,
though there is no guaranteed ordering amogst the newest half.

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-31 15:38:31 -05:00
Zach Brown
f2a11d7777 Lookup cluster locks with an RCU hash table
The previous work we did introduce the per-lock spinlock and the
refcount now make it easy to switch from an rbtree protected by a
spinlock to a hash table protected by RCU read critical sections.
The cluster lock lookup fast path now only dirties fields in the
scoutfs_lock struct itself.

We have to be a little careful when inserting so that users can't get
references to locks that made it into the hash table but which then had
to be removed because they were found to overlap.

Freeing is straight forward and we only have to make sure to free the
locks in RCU grace periods so that read sections can continue to
reference the memory and see the refcount that indicates that the locks
are freeing.

A few remaining places were using the lookup rbtree to walk all locks,
they're converted to using the range tree that we're keeping around to
resolve overlapping ranges but which is also handy for iteration that
isn't performance sensitive.

The LRU still does create contention on the linfo spinlock on every
lookup, fixing that is next.

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-31 15:38:31 -05:00
Zach Brown
4c2a287474 Protect clusters locks with refcounts
The first pass at managing the cluster lock state machine used a simple
global spinlock.  It's time to break it up.

This adds refcounting to the cluster lock struct.  Rather than managing
global data structures and individual lock state all under a global
spinlock, we use per-structure locks, a lock spinlock, and a lock
refcount.

Active users of the cluster lock hold a reference.  This primarily lets
unlock only check global structures once the refcounts say that it's
time to remove the lock from the structures.  The careful use of the
refcount to avoid locks that are being freed during lookup also paves
the way for using mostly read-only RCU lookup structures soon.

The global LRU is still modified on every lock use, that'll also be
removed up in future work.

The linfo spinlock is now only used for the LRU and lookup structures.
Other uses are removed, which causes more careful use of the finer
grained locks that initially just mirrored the use of the linfo spinlock
to keep those introduction patches safe.

The move from a single global lock to more fine grained locks creates
nesting that needs to be managed.  Shrinking and recovery in particular
need to be careful as they transition from spinlocks used to find
cluster locks to getting the cluster lock spinlock.

The presence of freeing locks in the lookup indexes means that some
callers need to retry if they hit freeing locks.  We have to add this
protection to recovery iterating over locks by their key value, but it
wouldn't have made sense to build that around the lookup rbtree as its
going away.  It makes sense to use the range tree that we're going to
keep using to make sure we don't accidentally introduce locks whose
ranges overlap (which would lead to item cache inconsistency).

Signed-off-by: Zach Brown <zab@versity.com>
Signed-off-by: Chris Kirby <ckirby@versity.com>
2025-10-31 15:38:31 -05:00
Zach Brown
363cc00519 Add per-cluster lock spinlock
Add a spinlock to the scoutfs_lock cluster lock which protects its
state.   This replaces the use of the mount-wide lock_info spinlock.

In practice, for now, this largely just mirrors the continued use of the
lock_info spinlock because it's still needed to protect the mount-wide
structures that are used during put_lock.   That'll be fixed in future
patches as the use of global structures is reduced.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-31 15:38:31 -05:00
Zach Brown
baaba6ef03 Cluster lock invalidation and shrink spinlocks
Cluster lock invalidation and shrinking have very similar work flows.
They rarely modify the state of locks and put them on lists for work to
process.  Today the lists and state modification are protected by the
mount-wide lock_info spinlock, which we want to break up.

This creates a little work_list struct that has a work_queue, list, and
lock.   Invalidation and shrinking use this to track locks that are
being processed and protect the list with the new spinlock in the
struct.

This leaves some awkward nesting with the lock_info spinlock because it
still protects invididual lock state.  That will be fixed as we move
towards individual lock refcounting and spinlocks.

Signed-off-by: Zach Brown <zab@versity.com>
2025-10-31 15:38:31 -05:00
4 changed files with 529 additions and 294 deletions

View File

@@ -482,7 +482,7 @@ int scoutfs_complete_truncate(struct inode *inode, struct scoutfs_lock *lock)
}
/*
* If we're changing the file size than the contents of the file are
* If we're changing the file size then the contents of the file are
* changing and we increment the data_version. This would prevent
* staging because the data_version is per-inode today, not per-extent.
* So if there are any offline extents within the new size then we need

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,8 @@
#ifndef _SCOUTFS_LOCK_H_
#define _SCOUTFS_LOCK_H_
#include <linux/rhashtable.h>
#include "key.h"
#include "tseq.h"
@@ -19,20 +21,24 @@ struct inode_deletion_lock_data;
*/
struct scoutfs_lock {
struct super_block *sb;
atomic_t refcount;
spinlock_t lock;
struct rcu_head rcu_head;
struct scoutfs_key start;
struct scoutfs_key end;
struct rb_node node;
struct rhash_head ht_head;
struct rb_node range_node;
u64 refresh_gen;
u64 write_seq;
u64 dirty_trans_seq;
struct list_head lru_head;
int lru_on_list;
wait_queue_head_t waitq;
unsigned long request_pending:1,
invalidate_pending:1;
struct list_head inv_head; /* entry in linfo's list of locks with invalidations */
struct list_head inv_list; /* list of lock's invalidation requests */
struct list_head inv_req_list; /* list of lock's invalidation requests */
struct list_head shrink_head;
spinlock_t cov_list_lock;

View File

@@ -1100,6 +1100,7 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
__field(unsigned char, invalidate_pending)
__field(int, mode)
__field(int, invalidating_mode)
__field(unsigned int, refcount)
__field(unsigned int, waiters_cw)
__field(unsigned int, waiters_pr)
__field(unsigned int, waiters_ex)
@@ -1118,6 +1119,7 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
__entry->invalidate_pending = lck->invalidate_pending;
__entry->mode = lck->mode;
__entry->invalidating_mode = lck->invalidating_mode;
__entry->refcount = atomic_read(&lck->refcount);
__entry->waiters_pr = lck->waiters[SCOUTFS_LOCK_READ];
__entry->waiters_ex = lck->waiters[SCOUTFS_LOCK_WRITE];
__entry->waiters_cw = lck->waiters[SCOUTFS_LOCK_WRITE_ONLY];
@@ -1125,11 +1127,11 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
__entry->users_ex = lck->users[SCOUTFS_LOCK_WRITE];
__entry->users_cw = lck->users[SCOUTFS_LOCK_WRITE_ONLY];
),
TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" mode %u invmd %u reqp %u invp %u refg %llu wris %llu dts %llu waiters: pr %u ex %u cw %u users: pr %u ex %u cw %u",
TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" mode %u invmd %u reqp %u invp %u refg %llu rfcnt %d wris %llu dts %llu waiters: pr %u ex %u cw %u users: pr %u ex %u cw %u",
SCSB_TRACE_ARGS, sk_trace_args(start), sk_trace_args(end),
__entry->mode, __entry->invalidating_mode, __entry->request_pending,
__entry->invalidate_pending, __entry->refresh_gen, __entry->write_seq,
__entry->dirty_trans_seq,
__entry->invalidate_pending, __entry->refresh_gen, __entry->refcount,
__entry->write_seq, __entry->dirty_trans_seq,
__entry->waiters_pr, __entry->waiters_ex, __entry->waiters_cw,
__entry->users_pr, __entry->users_ex, __entry->users_cw)
);