Commit Graph

197 Commits

Author SHA1 Message Date
William Banfield
70624e8d27 more lock instrumentation 2022-08-03 18:20:12 -04:00
William Banfield
6b16cf6d68 Revert "update stats queue to be smaller"
This reverts commit d176124aa0.
2022-08-03 18:15:46 -04:00
William Banfield
d392a07b99 no peer status 2022-08-03 18:02:11 -04:00
William Banfield
d176124aa0 update stats queue to be smaller 2022-08-03 17:27:04 -04:00
William Banfield
705316442a More metrics 2022-08-03 16:58:12 -04:00
William Banfield
83dea898fb add metrics 2022-08-03 16:45:44 -04:00
William Banfield
c764cebbe7 add unlock 2022-08-03 16:20:06 -04:00
William Banfield
f859f5ef6e add intermediate lock log 2022-08-03 13:51:39 -04:00
William Banfield
92a8e74fdf add more logs 2022-08-03 11:26:08 -04:00
William Banfield
5d2593c6ee add lock logs 2022-07-29 15:49:08 -04:00
mergify[bot]
0d2bf39c23 indexer: work around indexing problem for duplicate transactions (forward port: #8625) (#8950) 2022-07-21 19:33:08 +02:00
Callum Waters
3e96a376b0 spec: merge v0.35 spec into tendermint (#9018) 2022-07-20 12:37:46 +02:00
M. J. Fromberger
22ed610083 mempool: rework lock discipline to mitigate callback deadlocks (#9030)
The priority mempool has a stricter synchronization requirement than the legacy
mempool. Under sufficiently-heavy load, exclusive access can lead to deadlocks
when processing a large batch of transaction rechecks through an out-of-process
application using the socket client.

By design, a socket client stalls when its send buffer fills, during which time
it holds a lock shared with the receive thread.  While blocked in this state, a
response read by the receive thread waits for the shared lock so the callback
can be invoked.

If we're lucky, the server will then read the next request and make enough room
in the buffer for the sender to proceed. If not however (e.g., if the next
request is bigger than the one just consumed), the receive thread is blocked:
It is waiting on the lock and cannot read a response.  Once the server's output
buffer fills, the system deadlocks.

This can happen with any sufficiently-busy workload, but is more likely during
a large recheck in the v1 mempool, where the callbacks need exclusive access to
mempool state.  As a workaround, process rechecks for the priority mempool in
their own goroutines outside the mempool mutex.  Responses still head-of-line
block, but will no longer get pushback due to contention on the mempool itself.
2022-07-19 13:28:46 -07:00
mergify[bot]
6b18dfcea1 Extract a library from the confix command-line tool. (backport #9012) (#9025)
(cherry picked from commit 18b5a500da)

Pull out the library functionality from scripts/confix and move it to
internal/libs/confix. Replace scripts/confix with a simple stub that has the
same command-line API, but uses the library instead.

Related:

- Move and update unit tests.
- Move scripts/confix/condiff to scripts/condiff.
- Update test data for v34, v35, and v36.
- Update reference diffs.
- Update testdata README.

Co-authored-by: M. J. Fromberger <fromberger@interchain.io>
2022-07-15 08:46:28 -07:00
M. J. Fromberger
b94470a6a4 mempool: ensure evicted transactions are removed from the cache (#9000)
In the original implementation transactions evicted for priority were also
removed from the cache. In addition, remove expired transactions from
the cache.

Related:

- Add Has method to cache implementations.
- Update tests to exercise this condition.
2022-07-14 06:51:54 -07:00
M. J. Fromberger
3790968156 mempool: release lock during app connection flush (#8984)
This case is symmetric to what we did for CheckTx calls, where we release the
mempool mutex to ensure callbacks can fire during call setup.  We also need
this behaviour for application flush, for the same reason: The caller holds the
lock by contract from the Mempool interface.
2022-07-12 10:28:51 -07:00
M. J. Fromberger
9e64c95e56 mempool: reduce lock contention during CheckTx (cleanup) (#8983)
The way this was originally structured, we reacquired the lock after issuing
the initial ABCI CheckTx call, only to immediately release it. Restructure the
code so that this redundant acquire is no longer necessary.
2022-07-12 08:00:29 -07:00
M. J. Fromberger
cb93d3b587 mempool: don't log message type mismatch in the default callback (#8969) 2022-07-11 18:06:49 -07:00
M. J. Fromberger
f98de20f7e p2p: ensure closed channels stop receiving service (#8979)
Once these channels are closed, we should not continue to service them, as they
will never again deliver nonzero values.
2022-07-11 16:34:05 -07:00
M. J. Fromberger
451e697331 Update generated mocks after upgrade of Mockery v2. (#8973) 2022-07-11 09:18:36 -04:00
mergify[bot]
e3292a48e3 p2p: simpler priority queue (backport #8929) (#8956) 2022-07-08 13:29:42 -04:00
mergify[bot]
1daf7b939d p2p: make peer gossiping coinflip safer (#8949) (#8963)
Closes #8948

(cherry picked from commit 61ce384d75)

Co-authored-by: Sam Kleinman <garen@tychoish.com>
2022-07-08 12:32:12 -04:00
mergify[bot]
156c305b08 p2p: delete cruft (#8958) (#8959)
I think the decision in #8806 is that we shouldn't do this yet, so I think it's best to just drop this.

(cherry picked from commit 636320f901)

Co-authored-by: Sam Kleinman <garen@tychoish.com>
2022-07-08 09:59:57 -04:00
M. J. Fromberger
bc49f66c35 Add more unit tests for the priority mempool. (#8961)
- Add a test for time-based (TTL) expiration.
- Add tests for eviction based on size and priority.
2022-07-07 14:56:34 -07:00
M. J. Fromberger
9b02094827 Fix unbounded heap growth in the priority mempool. (#8944)
The primary effect of this change is to simplify the implementation of the
priority mempool to eliminate an unbounded heap growth observed by Vega team
when it was enabled in their testnet. It updates and fixes #8775.

The main body of this change is to remove the auxiliary indexing structures,
and use only the concurrent list structure (the same as the legacy mempool) to
maintain both gossip order and priority.

This means that operations that require priority information, such as block
updates and insert-time evictions, require a linear scan over the mempool.
This tradeoff greatly simplifies the code and eliminates the long-term heap
load, at the cost of some extra CPU and short-lived working memory during
CheckTx and Update calls.

Rough benchmark results:

 - This PR:
   BenchmarkTxMempool_CheckTx-10             486373              2271 ns/op
 - Original priority mempool implementation:
   BenchmarkTxMempool_CheckTx-10             500302              2113 ns/op
 - Legacy (v0) mempool:
   BenchmarkCheckTx-10                       364591              3571 ns/op

These benchmarks are not a good proxy for production load, but at least suggest
that the overhead of the implementation changes are not cause for concern.

In addition:

- Rework synchronization so that access to shared data structures is safe.
  Previously shared locks were used to exclude block updates during calls that
  update mempool state. Now access is properly exclusive where necessary.

- Fix a bug in the recheck flow, where priority updates from the application
  were not correctly reflected in the index structures.

- Eliminate the need for separate recheck cursors during block update. This
  avoids the need to explicitly invalidate elements of the concurrent list,
  which averts the dependency cycle that led to objects being pinned.

- Clean up, clarify, and fix inaccuracies in documentation comments throughout
  the package.

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
2022-07-07 07:15:08 -07:00
William Banfield
da83edc588 p2p: return from conn send on stopped mconn (#8904)
Co-authored-by: Sam Kleinman <garen@tychoish.com>
2022-07-06 10:41:55 -04:00
mergify[bot]
047d7c927b p2p: fix flakey test due to disconnect cooldown (#8917) (#8918)
This test was made flakey by #8839. The cooldown period means that the node in the test will not try to reconnect as quickly as the test expects. This change makes the cooldown shorter in the test so that the node quickly reconnects.

(cherry picked from commit 5274f80de4)

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Sam Kleinman <garen@tychoish.com>
2022-07-05 19:11:38 -04:00
mergify[bot]
49788adde5 p2p: use correct context error (#8916) (#8920)
handshakeCtx is the internal context carrying the timeout. Its error should be used for the error return.

(cherry picked from commit 921530c352)

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Sam Kleinman <garen@tychoish.com>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
2022-07-05 13:36:26 -04:00
William Banfield
978f754ad3 p2p: set empty timeouts to configed values. (manual backport of #8847) (#8869)
* regenerate mocks using newer style

* p2p: set empty timeouts to small values. (#8847)

These timeouts default to 'do not time out' if they are not set. This times up resources, potentially indefinitely. If node on the other side of the the handshake is up but unresponsive, the[ handshake call](edec79448a/internal/p2p/router.go (L720)) will _never_ return.

* fix light client select statement
2022-06-28 16:07:15 -04:00
mergify[bot]
c4ef566071 p2p: remove dial sleep and provide disconnect cooldown (backport #8839) (#8875)
(cherry picked from commit 52b6dc19ba)
2022-06-27 10:49:51 -04:00
mergify[bot]
826f224c2d p2p: add eviction metrics and cleanup dialing error handling (backport #8819) (#8820) 2022-06-24 10:42:58 -04:00
mergify[bot]
6f4ef72964 p2p: track peers by address (#8841) (#8855)
(cherry picked from commit 436a38f876)

Co-authored-by: Sam Kleinman <garen@tychoish.com>
2022-06-23 13:21:46 -04:00
mergify[bot]
24701cd587 p2p: more dial routines (#8827) (#8828) 2022-06-21 21:27:28 -04:00
William Banfield
e9c87a3c49 remove dial wake change (#8824) 2022-06-21 20:20:04 -04:00
Callum Waters
4322f7d0b9 mempool: make error throwing for CheckTx consistent (#8817) 2022-06-21 18:51:50 +02:00
Sam Kleinman
83526cacbc p2p: peer store and dialing changes (0.35.x backport) (#8740)
* p2p: peer store and dialing changes

(cherry picked from commit 9dbb135152)

* reduce persistent peer max

(cherry picked from commit b213a2766f)

* don't gossip inactive peers

(cherry picked from commit cc28ce298f)

* fix small case

(cherry picked from commit 56a91642dc)

* fix error message

(cherry picked from commit 86db59f53b)

* remove seed flag

(cherry picked from commit 000aa05485)

* reduce logging level

(cherry picked from commit 4e2bc8f51e)

* make const

(cherry picked from commit e3068b50b2)

* update comment

(cherry picked from commit 31bd396c88)

* cleanup

(cherry picked from commit eddb23b5af)

* oops

* overflows

(cherry picked from commit 4c8651026a)

* Update internal/p2p/peermanager.go

Co-authored-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
(cherry picked from commit f23f6e1089)

* Update internal/p2p/peermanager.go

Co-authored-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
(cherry picked from commit 1c02758eaf)

* comment

(cherry picked from commit 9f604fd2ef)

* test: new scoring

(cherry picked from commit 930fd7f2be)

* fix scoring test

(cherry picked from commit 9abc55f3a0)

* cleanup peer manager

* fix panic

* add metrics

* fix compile

* fix test

* default metrics to noop

* noop metrics

* update metrics

(cherry picked from commit 720600ef62)

* rename metrics

* actually shuffle peers more

* fix up advertise

(cherry picked from commit 8195c97590)

* add max dialing attempts

* connection tracking

* comments mostly

(cherry picked from commit 053ecd9b8c)

* Apply suggestions from code review

Co-authored-by: M. J. Fromberger <michael.j.fromberger@gmail.com>

* comments

* fix lint

* cr feedback

* fixup cherrypick

* make wb happy

* more comments

* fixup

* fix lint

* iota fix

* add skip

* cleanup

* remove comment

* fix rand

* fix rand

* use numaddresses correctly

* advertise fixes

* remove some things

* cleanup comment

* more fixes

* toml

* fix comment

* fix spell

* dec limit

* fixes

* up the attmept max

* cr feedback

* probablistic test

* fix spell

* add metrics for peers stored on startup

* p2p: peer score should not wrap around (#8790)

(cherry picked from commit 4d820ff4f5)

# Conflicts:
#	internal/p2p/peermanager.go

* fix

* wake more

* wake if we need to

Co-authored-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
2022-06-20 13:13:21 -04:00
mergify[bot]
74c6d8100d p2p: fix typo (#8793) (#8794) 2022-06-19 11:52:43 -07:00
mergify[bot]
ce8284c027 p2p: accept should not abort on first error (backport #8759) (#8760) 2022-06-15 07:56:15 -04:00
Callum Waters
28c38522e0 do not log an error for duplicate txs (#8732) 2022-06-10 11:56:00 +02:00
mergify[bot]
af0590a819 consensus: switch timeout message to be debug and clarify meaning (#8694) (#8696)
(cherry picked from commit 75a12ea0c6)

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Sam Kleinman <garen@tychoish.com>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
2022-06-09 09:45:58 -04:00
mergify[bot]
0e3a3fe58b p2p: shed peers from store from other networks (backport #8678) (#8681) 2022-06-02 12:15:55 -04:00
Callum Waters
e8ac37223f pex: align max address thresholds (#8657) 2022-05-31 14:07:25 -04:00
Sam Kleinman
a889f17e51 consensus: restructure peer catchup sleep (#8651) 2022-05-31 11:31:51 -04:00
mergify[bot]
4ee91663da p2p: reduce ability of SendError to disconnect peers (backport #8597) (#8603) 2022-05-25 04:12:43 -04:00
mergify[bot]
2f8483aa85 p2p: remove unused get height methods (backport #8569) (#8571) 2022-05-17 11:32:13 -04:00
mergify[bot]
12fed0ed53 blocksync: validate block before persisting it (backport #8493) (#8496) 2022-05-12 10:36:48 +02:00
Sam Kleinman
bdd59c892c statesync: avoid potential race (#8494) 2022-05-11 15:09:41 -04:00
mergify[bot]
14f0d60f24 p2p: fix setting in con-tracker (#8370) (#8371)
(cherry picked from commit 889341152a)
2022-04-19 23:32:54 -07:00
mergify[bot]
04c1f76569 rpc: avoid leaking threads during checktx (backport #8328) (#8333) 2022-04-17 09:17:03 -04:00
Ethan Reesor
226bc94c5f node: always close database engine (#7113) (#8330) 2022-04-15 14:37:34 -07:00