Update tracing with cluster lock changes

Signed-off-by: Zach Brown <zab@versity.com>
Directly queue cluster lock work
2026-01-06 12:06:26 +00:00 · 2025-10-31 15:38:31 -05:00 · 2025-10-31 15:38:31 -05:00 · 2025-10-31 15:38:31 -05:00 · 2025-10-31 15:38:31 -05:00 · 2025-10-31 15:38:31 -05:00
4 changed files with 529 additions and 294 deletions
--- a/kmod/src/inode.c
+++ b/kmod/src/inode.c
@@ -482,7 +482,7 @@ int scoutfs_complete_truncate(struct inode *inode, struct scoutfs_lock *lock)
 }

 /*
- * If we're changing the file size than the contents of the file are
+ * If we're changing the file size then the contents of the file are
 * changing and we increment the data_version.  This would prevent
 * staging because the data_version is per-inode today, not per-extent.
 * So if there are any offline extents within the new size then we need
--- a/kmod/src/lock.c
+++ b/kmod/src/lock.c
--- a/kmod/src/lock.h
+++ b/kmod/src/lock.h
@@ -1,6 +1,8 @@
 #ifndef _SCOUTFS_LOCK_H_
 #define _SCOUTFS_LOCK_H_

+#include <linux/rhashtable.h>
+
 #include "key.h"
 #include "tseq.h"

@@ -19,20 +21,24 @@ struct inode_deletion_lock_data;
 */
 struct scoutfs_lock {
 	struct super_block *sb;
+	atomic_t refcount;
+	spinlock_t lock;
+	struct rcu_head rcu_head;
 	struct scoutfs_key start;
 	struct scoutfs_key end;
-	struct rb_node node;
+	struct rhash_head ht_head;
 	struct rb_node range_node;
 	u64 refresh_gen;
 	u64 write_seq;
 	u64 dirty_trans_seq;
 	struct list_head lru_head;
+	int lru_on_list;
 	wait_queue_head_t waitq;
 	unsigned long request_pending:1,
 		      invalidate_pending:1;

 	struct list_head inv_head;  /* entry in linfo's list of locks with invalidations */
-	struct list_head inv_list;  /* list of lock's invalidation requests */
+	struct list_head inv_req_list;  /* list of lock's invalidation requests */
 	struct list_head shrink_head;

 	spinlock_t cov_list_lock;
--- a/kmod/src/scoutfs_trace.h
+++ b/kmod/src/scoutfs_trace.h
@@ -1100,6 +1100,7 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
 		__field(unsigned char, invalidate_pending)
 		__field(int, mode)
 		__field(int, invalidating_mode)
+		__field(unsigned int, refcount)
 		__field(unsigned int, waiters_cw)
 		__field(unsigned int, waiters_pr)
 		__field(unsigned int, waiters_ex)
@@ -1118,6 +1119,7 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
 		__entry->invalidate_pending = lck->invalidate_pending;
 		__entry->mode = lck->mode;
 		__entry->invalidating_mode = lck->invalidating_mode;
+		__entry->refcount = atomic_read(&lck->refcount);
 		__entry->waiters_pr = lck->waiters[SCOUTFS_LOCK_READ];
 		__entry->waiters_ex = lck->waiters[SCOUTFS_LOCK_WRITE];
 		__entry->waiters_cw = lck->waiters[SCOUTFS_LOCK_WRITE_ONLY];
@@ -1125,11 +1127,11 @@ DECLARE_EVENT_CLASS(scoutfs_lock_class,
 		__entry->users_ex = lck->users[SCOUTFS_LOCK_WRITE];
 		__entry->users_cw = lck->users[SCOUTFS_LOCK_WRITE_ONLY];
        ),
-        TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" mode %u invmd %u reqp %u invp %u refg %llu wris %llu dts %llu waiters: pr %u ex %u cw %u users: pr %u ex %u cw %u",
+        TP_printk(SCSBF" start "SK_FMT" end "SK_FMT" mode %u invmd %u reqp %u invp %u refg %llu rfcnt %d wris %llu dts %llu waiters: pr %u ex %u cw %u users: pr %u ex %u cw %u",
 		  SCSB_TRACE_ARGS, sk_trace_args(start), sk_trace_args(end),
 		  __entry->mode, __entry->invalidating_mode, __entry->request_pending,
-		  __entry->invalidate_pending, __entry->refresh_gen, __entry->write_seq,
-		  __entry->dirty_trans_seq,
+		  __entry->invalidate_pending, __entry->refresh_gen, __entry->refcount,
+		  __entry->write_seq, __entry->dirty_trans_seq,
 		  __entry->waiters_pr, __entry->waiters_ex, __entry->waiters_cw,
 		  __entry->users_pr, __entry->users_ex, __entry->users_cw)
 );
Author	SHA1	Message	Date
Zach Brown	96049fe4f9	Update tracing with cluster lock changes Signed-off-by: Zach Brown <zab@versity.com>	2025-10-31 15:38:31 -05:00
Zach Brown	6b67aee2e3	Directly queue cluster lock work We had a little helper that scheduled work after testing the list, which required holding the spinlock. This was a little too crude and required scoutfs_unlock() acquiring the invalidate work list spinlock even though it already had the cluster lock spinlock held and could see that there are invalidate requests pending and should queue the invalidation work. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-31 15:38:31 -05:00
Zach Brown	09fe4fddd4	Two cluster lock LRU lists with less precision Currently we maintain a single LRU list of cluster locks and every time we acquire a cluster lock we move it to the head of the LRU, creating significant contention acquiring the spinlock that protects the LRU list. This moves to two LRU lists, a list of cluster locks ready to be reclaimed and one for locks that are in active use. We mark locks with which list they're on and only move them to the active list if they're on the reclaim list. We track imbalance between the two lists so that they're always roughly the same size. This removes contention maintaining a precise LRU amongst a set of active cluster locks. It doesn't address contention creating or removing locks, which are already very expensive operations. It also loses strict ordering by access time. Reclaim has to make it through the oldest half of locks before getting to the newer half, though there is no guaranteed ordering amogst the newest half. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-31 15:38:31 -05:00
Zach Brown	f2a11d7777	Lookup cluster locks with an RCU hash table The previous work we did introduce the per-lock spinlock and the refcount now make it easy to switch from an rbtree protected by a spinlock to a hash table protected by RCU read critical sections. The cluster lock lookup fast path now only dirties fields in the scoutfs_lock struct itself. We have to be a little careful when inserting so that users can't get references to locks that made it into the hash table but which then had to be removed because they were found to overlap. Freeing is straight forward and we only have to make sure to free the locks in RCU grace periods so that read sections can continue to reference the memory and see the refcount that indicates that the locks are freeing. A few remaining places were using the lookup rbtree to walk all locks, they're converted to using the range tree that we're keeping around to resolve overlapping ranges but which is also handy for iteration that isn't performance sensitive. The LRU still does create contention on the linfo spinlock on every lookup, fixing that is next. Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-31 15:38:31 -05:00
Zach Brown	4c2a287474	Protect clusters locks with refcounts The first pass at managing the cluster lock state machine used a simple global spinlock. It's time to break it up. This adds refcounting to the cluster lock struct. Rather than managing global data structures and individual lock state all under a global spinlock, we use per-structure locks, a lock spinlock, and a lock refcount. Active users of the cluster lock hold a reference. This primarily lets unlock only check global structures once the refcounts say that it's time to remove the lock from the structures. The careful use of the refcount to avoid locks that are being freed during lookup also paves the way for using mostly read-only RCU lookup structures soon. The global LRU is still modified on every lock use, that'll also be removed up in future work. The linfo spinlock is now only used for the LRU and lookup structures. Other uses are removed, which causes more careful use of the finer grained locks that initially just mirrored the use of the linfo spinlock to keep those introduction patches safe. The move from a single global lock to more fine grained locks creates nesting that needs to be managed. Shrinking and recovery in particular need to be careful as they transition from spinlocks used to find cluster locks to getting the cluster lock spinlock. The presence of freeing locks in the lookup indexes means that some callers need to retry if they hit freeing locks. We have to add this protection to recovery iterating over locks by their key value, but it wouldn't have made sense to build that around the lookup rbtree as its going away. It makes sense to use the range tree that we're going to keep using to make sure we don't accidentally introduce locks whose ranges overlap (which would lead to item cache inconsistency). Signed-off-by: Zach Brown <zab@versity.com> Signed-off-by: Chris Kirby <ckirby@versity.com>	2025-10-31 15:38:31 -05:00
Zach Brown	363cc00519	Add per-cluster lock spinlock Add a spinlock to the scoutfs_lock cluster lock which protects its state. This replaces the use of the mount-wide lock_info spinlock. In practice, for now, this largely just mirrors the continued use of the lock_info spinlock because it's still needed to protect the mount-wide structures that are used during put_lock. That'll be fixed in future patches as the use of global structures is reduced. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-31 15:38:31 -05:00
Zach Brown	baaba6ef03	Cluster lock invalidation and shrink spinlocks Cluster lock invalidation and shrinking have very similar work flows. They rarely modify the state of locks and put them on lists for work to process. Today the lists and state modification are protected by the mount-wide lock_info spinlock, which we want to break up. This creates a little work_list struct that has a work_queue, list, and lock. Invalidation and shrinking use this to track locks that are being processed and protect the list with the new spinlock in the struct. This leaves some awkward nesting with the lock_info spinlock because it still protects invididual lock state. That will be fixed as we move towards individual lock refcounting and spinlocks. Signed-off-by: Zach Brown <zab@versity.com>	2025-10-31 15:38:31 -05:00