Commit Graph

9217 Commits

Author SHA1 Message Date
M. J. Fromberger
3e9ecd8197 Prepare changelog for v0.35.0-rc4. (#7181) v0.35.0-rc4 2021-10-29 09:57:23 -07:00
Sam Kleinman
e40a8468a4 config: backport file writing changes (#7182) 2021-10-29 06:38:52 -04:00
M. J. Fromberger
85086d7452 Fix metric cardinality left over from backport (#7180)
One of the patched uses in #7161 missed the message type field,
triggering panic failures from Prometheus.
2021-10-28 15:29:53 -07:00
mergify[bot]
8314f24d79 pubsub: Use distinct client IDs for test subscriptions. (#7178) (#7179)
Fixes #7176. Some of the benchmarks create a bunch of different subscriptions all sharing the same query. These were all using the same client ID, which violates one of the subscriber rules. Ensure each subscriber gets a unique ID.

This has been broken as long as this library has been in the repo—I tracked it back to bb9aa85d and it was already failing there, so I think this never really worked. I'm not sure these test anything useful, but at least now they run.

(cherry picked from commit 1fd7060542)

Co-authored-by: M. J. Fromberger <fromberger@interchain.io>
2021-10-28 05:59:39 -04:00
mergify[bot]
dd1471da91 p2p: add message type into the send/recv bytes metrics (backport #7155) (#7161)
* p2p: add message type into the send/recv bytes metrics (#7155)

This pull request adds a new "mesage_type" label to the send/recv bytes metrics calculated in the p2p code.

Below is a snippet of the updated metrics that includes the updated label:
```
tendermint_p2p_peer_receive_bytes_total{chID="32",chain_id="ci",message_type="consensus_HasVote",peer_id="2551a13ed720101b271a5df4816d1e4b3d3bd133"} 652
tendermint_p2p_peer_receive_bytes_total{chID="32",chain_id="ci",message_type="consensus_HasVote",peer_id="4b1068420ef739db63377250553562b9a978708a"} 631
tendermint_p2p_peer_receive_bytes_total{chID="32",chain_id="ci",message_type="consensus_HasVote",peer_id="927c50a5e508c747830ce3ba64a3f70fdda58ef2"} 631
tendermint_p2p_peer_receive_bytes_total{chID="32",chain_id="ci",message_type="consensus_NewRoundStep",peer_id="2551a13ed720101b271a5df4816d1e4b3d3bd133"} 393
tendermint_p2p_peer_receive_bytes_total{chID="32",chain_id="ci",message_type="consensus_NewRoundStep",peer_id="4b1068420ef739db63377250553562b9a978708a"} 357
tendermint_p2p_peer_receive_bytes_total{chID="32",chain_id="ci",message_type="consensus_NewRoundStep",peer_id="927c50a5e508c747830ce3ba64a3f70fdda58ef2"} 386
```

(cherry picked from commit b4bc6bb4e8)
2021-10-27 07:34:24 -04:00
mergify[bot]
c6d62cc8b2 docs: fix broken links and layout (#7154) (#7163)
This PR does a few minor touch ups to the docs

(cherry picked from commit ce89292712)

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
2021-10-27 05:14:35 -04:00
mergify[bot]
ce6014ddf5 docs: add reactor sections (backport #6510) (#7151) 2021-10-22 18:29:33 +02:00
mergify[bot]
e62a75b627 state: add height assertion to rollback function (#7143) (#7148)
(cherry picked from commit a8ff617773)

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
2021-10-21 18:07:51 +02:00
mergify[bot]
dbc72e0d69 mempool: remove panic when recheck-tx was not sent to ABCI application (#7134) (#7142)
This pull request fixes a panic that exists in both mempools. The panic occurs when the ABCI client misses a response from the ABCI application. This happen when the ABCI client drops the request as a result of a full client queue. The fix here was to loop through the ordered list of recheck-tx in the callback until one matches the currently observed recheck request.

(cherry picked from commit b0130c88fb)

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
2021-10-19 10:21:47 -04:00
mergify[bot]
57e4e18ba3 build: Fix build-docker to include the full context. (#7114) (#7116)
Fixes #7068. The build-docker rule relies on being able to run make
build-linux, but did not pull the Makefile into the build context.
There are various ways to fix this, but this was probably the smallest.

(cherry picked from commit 6538776e6a)

Co-authored-by: M. J. Fromberger <fromberger@interchain.io>
2021-10-12 16:50:34 -07:00
mergify[bot]
b7fe214b81 Revert "abci: change client to use multi-reader mutexes (#6306)" (backport #7106) (#7110)
* Revert "abci: change client to use multi-reader mutexes (#6306)" (#7106)

This reverts commit 1c4dbe30d4.

(cherry picked from commit 34a3fcd8fc)
2021-10-12 12:03:00 -04:00
mergify[bot]
66e8eec194 light: Update links in package docs. (#7099) (#7101)
Fixes #7098. The light client documentation moved to the spec repository.

I was not able to figure out what happened to light-client-protocol.md, it was removed in #5252 but no corresponding file exists in the spec repository. Since the spec also discusses the protocol, this change simply links to the spec and removes the non-functional reference.

Alternatively we could link to the top-level [light client doc](https://docs.tendermint.com/master/tendermint-core/light-client.html) if you think that's better.

(cherry picked from commit 48295955ed)

Co-authored-by: M. J. Fromberger <fromberger@interchain.io>
2021-10-11 19:49:25 -07:00
mergify[bot]
22e33aba98 e2e: light nodes should use builtin abci app (#7095) (#7097)
(cherry picked from commit befd669794)

Co-authored-by: Sam Kleinman <garen@tychoish.com>
2021-10-09 00:32:53 -04:00
mergify[bot]
af85f7e917 e2e: abci protocol should be consistent across networks (#7078) (#7086)
It seems weird in retrospect that we allow networks to contain
applications that use different ABCI protocols.

(cherry picked from commit f2a8f5e054)

Co-authored-by: Sam Kleinman <garen@tychoish.com>
2021-10-08 10:15:15 -04:00
mergify[bot]
f0cd54825f cli: allow node operator to rollback last state (backport #7033) (#7081) 2021-10-08 09:56:18 +02:00
M. J. Fromberger
98bc4f0e2b Update changelog for v0.35.0-rc3. (#7074) v0.35.0-rc3 2021-10-06 11:19:26 -07:00
mergify[bot]
bff85fc07b mempool,rpc: add removetx rpc method (#7047) (#7065)
Addresses one of the concerns with #7041.

Provides a mechanism (via the RPC interface) to delete a single transaction, described by its hash, from the mempool. The method returns an error if the transaction cannot be found. Once the transaction is removed it remains in the cache and cannot be resubmitted until the cache is cleared or it expires from the cache.

(cherry picked from commit 851d2e3bde)

Co-authored-by: Sam Kleinman <garen@tychoish.com>
2021-10-05 16:36:21 -04:00
mergify[bot]
4a952885c5 e2e: automatically prune old app snapshots (#7034) (#7063)
This PR tackles the case of using the e2e application in a long lived testnet. The application continually saves snapshots (usually every 100 blocks) which after a while bloats the size of the application. This PR prunes older snapshots so that only the most recent 10 snapshots remain.

(cherry picked from commit 5703ae2fb3)

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
2021-10-05 15:26:08 -04:00
William Banfield
42ed5d75a5 consensus: wait until peerUpdates channel is closed to close remaining peers (#7058) (#7060)
The race occurred as a result of a goroutine launched by `processPeerUpdate` racing with the `OnStop` method. The `processPeerUpdates` goroutine deletes from the map as `OnStop` is reading from it. This change updates the `OnStop` method to wait for the peer updates channel to be done before closing the peers. It also copies the map contents to a new map so that it will not conflict with the view of the map that the goroutine created in `processPeerUpdate` sees.
2021-10-05 10:49:26 -04:00
M. J. Fromberger
be684091ae Revert "Consolidate related changelog entries. (#7056)" (#7061)
This reverts commits:
  c16cd72c0a
  6ef847fdfe

We decided on another release candidate to sort out SDK merge issues.
2021-10-05 06:38:23 -07:00
M. J. Fromberger
c16cd72c0a Consolidate related changelog entries. (#7056) 2021-10-04 14:05:25 -07:00
M. J. Fromberger
6ef847fdfe Consolidate release candidate changelogs for v0.35. (#7052) 2021-10-04 12:22:32 -07:00
M. J. Fromberger
f361ce09b3 Update Go toolchains to 1.17 in Actions workflows. (#7049) v0.36.0-dev 2021-10-04 15:40:50 +00:00
William Banfield
243c62cc68 statesync: improve rare p2p race condition (#7042)
This is intended to fix a test failure that occurs in the p2p state provider. The issue presents as the state provider timing out waiting for the consensus params response. 

The reason that this can occur is because the statesync reactor has the possibility of attempting to respond to the params request before the state provider is ready to read it. This results in the reactor hitting the `default` case seen here and then never sending on the channel. The stateprovider will then block waiting for a response and never receive one because the reactor opted not to send it.
2021-10-01 20:33:12 +00:00
William Banfield
177850a2c9 statesync: remove deadlock on init fail (#7029)
When statesync is stopped during shutdown, it has the possibility of deadlocking. A dump of goroutines reveals that this is related to the peerUpdates channel not returning anything on its `Done()` channel when `OnStop` is called. As this is occuring, `processPeerUpdate` is attempting to acquire the reactor lock. It appears that this lock can never be acquired. I looked for the places where the lock may remain locked accidentally and cleaned them up in hopes to eradicate the issue. Dumps of the relevant goroutines may be found below. Note that the line numbers below are relative to the code in the `v0.35.0-rc1` tag.

```
goroutine 36 [chan receive]:
github.com/tendermint/tendermint/internal/statesync.(*Reactor).OnStop(0xc00058f200)
        github.com/tendermint/tendermint/internal/statesync/reactor.go:243 +0x117
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00058f200, 0x0, 0x0)
        github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc0001ea240)
        github.com/tendermint/tendermint/node/node.go:769 +0x132
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0001ea240, 0x0, 0x0)
        github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1()
        github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62
github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000629500, 0x7fdb52f96358, 0xc0002b5030, 0xc00000daa0)
        github.com/tendermint/tendermint/libs/os/os.go:26 +0x102
created by github.com/tendermint/tendermint/libs/os.TrapSignal
        github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6

goroutine 188 [semacquire]:
sync.runtime_SemacquireMutex(0xc00026b1cc, 0x0, 0x1)
        runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00026b1c8)
        sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        sync/mutex.go:81
sync.(*RWMutex).Lock(0xc00026b1c8)
        sync/rwmutex.go:111 +0x90
github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdate(0xc00026b080, 0xc000650008, 0x28, 0x124de90, 0x4)
        github.com/tendermint/tendermint/internal/statesync/reactor.go:849 +0x1a5
github.com/tendermint/tendermint/internal/statesync.(*Reactor).processPeerUpdates(0xc00026b080)
        github.com/tendermint/tendermint/internal/statesync/reactor.go:883 +0xab
created by github.com/tendermint/tendermint/internal/statesync.(*Reactor.OnStart
        github.com/tendermint/tendermint/internal/statesync/reactor.go:219 +0xcd)
```
2021-09-30 19:19:10 +00:00
M. J. Fromberger
bdd815ebc9 Align atomic struct field for compatibility in 32-bit ABIs. (#7037)
The layout of struct fields means that interior fields may not be properly
aligned for 64-bit access.

Fixes #7000.
2021-09-30 10:53:05 -07:00
M. J. Fromberger
77052370cc Update default config template to match mapstructure keys. (#7036)
Fix a couple of cases where we updated the keys in the config reader, but
forgot to update some of their uses in the default template.

Fixes #7031.
2021-09-30 17:13:32 +00:00
William Banfield
6a0d9c832a blocksync: fix shutdown deadlock issue (#7030)
When shutting down blocksync, it is observed that the process can hang completely. A dump of running goroutines reveals that this is due to goroutines not listening on the correct shutdown signal. Namely, the `poolRoutine` goroutine does not wait on `pool.Quit`. The `poolRoutine` does not receive any other shutdown signal during `OnStop` becuase it must stop before the `r.closeCh` is closed. Currently the `poolRoutine` listens in the `closeCh` which will not close until the `poolRoutine` stops and calls `poolWG.Done()`.

This change also puts the `requestRoutine()` in the `OnStart` method to make it more visible since it does not rely on anything that is spawned in the `poolRoutine`.

```
goroutine 183 [semacquire]:
sync.runtime_Semacquire(0xc0000d3bd8)
        runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc0000d3bd0)
        sync/waitgroup.go:130 +0x65
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStop(0xc0000d3a00)
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:193 +0x47
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc0000d3a00, 0x0, 0x0)
        github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/node.(*nodeImpl).OnStop(0xc00052c000)
        github.com/tendermint/tendermint/node/node.go:758 +0xc62
github.com/tendermint/tendermint/libs/service.(*BaseService).Stop(0xc00052c000, 0x0, 0x0)
        github.com/tendermint/tendermint/libs/service/service.go:171 +0x323
github.com/tendermint/tendermint/cmd/tendermint/commands.NewRunNodeCmd.func1.1()
        github.com/tendermint/tendermint/cmd/tendermint/commands/run_node.go:143 +0x62
github.com/tendermint/tendermint/libs/os.TrapSignal.func1(0xc000df6d20, 0x7f04a68da900, 0xc0004a8930, 0xc0005a72d8)
        github.com/tendermint/tendermint/libs/os/os.go:26 +0x102
created by github.com/tendermint/tendermint/libs/os.TrapSignal
        github.com/tendermint/tendermint/libs/os/os.go:22 +0xe6


goroutine 161 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).poolRoutine(0xc0000d3a00, 0x0)
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:464 +0x2b3
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:174 +0xf1

goroutine 162 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).processBlockSyncCh(0xc0000d3a00)
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:310 +0x151
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:177 +0x54

goroutine 163 [select]:
github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).processPeerUpdates(0xc0000d3a00)
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:363 +0x12b
created by github.com/tendermint/tendermint/internal/blocksync/v0.(*Reactor).OnStart
        github.com/tendermint/tendermint/internal/blocksync/v0/reactor.go:178 +0x76
```
2021-09-30 16:19:18 +00:00
Tess Rinearson
c9d92f5f19 .github: remove tessr and bez from codeowners (#7028) 2021-09-29 22:02:28 +02:00
Sam Kleinman
23fe6fd2f9 statesync: ensure test network properly configured (#7026)
This test reliably gets hung up on network configuration, (which may
be a real issue,) but it's network setup is handcranked and we should
ensure that the test focuses on it's core assertions and doesn't fail for 
test architecture reasons.
2021-09-29 16:38:27 +00:00
M. J. Fromberger
962caeae65 Make doc site index default to the latest release (#7023)
Fix the order of lines in docs/versions so that v0.34 is last (the current release).

Related changes:

- Update docs/DOCS_README.md to reflect the current state of how we publish the site.
- Fix the build-docs target in Makefile to not perturb the package-lock.json during the build.
- Fix the Makefile rule to not clobber package-lock.json.
2021-09-29 08:38:52 -07:00
Sam Kleinman
8758078786 consensus: avoid unbuffered channel in state test (#7025) 2021-09-29 11:19:34 -04:00
Sam Kleinman
b1dfbb8bc3 e2e: generator ensure p2p modes (#7021) 2021-09-28 17:04:37 -04:00
Sam Kleinman
c18470a5f1 e2e: use network size in load generator (#7019) 2021-09-28 16:47:35 -04:00
M. J. Fromberger
ea539dcb98 Update changelog for v0.35.0-rc2. (#7011)
Also, relinkify and update the bounty URL.
v0.35.0-rc2
2021-09-28 07:59:47 -07:00
Sam Kleinman
e35a42fc68 e2e: use smaller transactions (#7016)
75% of the failures in the last run all ran with the 10kb
transactions. I'd like to dial it back and see if things improve more.
2021-09-28 14:39:26 +00:00
lklimek
1bd1593f20 fix: race condition in p2p_switch and pex_reactor (#7015)
Closes https://github.com/tendermint/tendermint/issues/7014
2021-09-28 09:32:14 -04:00
Sam Kleinman
6be36613c9 e2e: reduce number of stateless nodes in test networks (#7010) 2021-09-27 17:00:05 -04:00
Sam Kleinman
9a16d930c6 statesync: add logging while waiting for peers (#7007) 2021-09-27 16:46:40 -04:00
Sam Kleinman
8023a2aeef e2e: add generator tests (#7008) 2021-09-27 15:38:03 -04:00
Sam Kleinman
6eaa3b24d6 ci: use cheaper codecov data collection (#7009) 2021-09-27 15:22:25 -04:00
Sam Kleinman
b150ea6b3e e2e: avoid seed nodes when statesyncing (#7006) 2021-09-27 14:08:08 -04:00
Sam Kleinman
b879f71e8e e2e: reduce log noise (#7004) 2021-09-27 13:27:08 -04:00
dependabot[bot]
bce7c2f73b build(deps): Bump google.golang.org/grpc from 1.40.0 to 1.41.0 (#7003) 2021-09-27 17:12:56 +02:00
Callum Waters
60a6c6fb1a e2e: allow running of single node using the e2e app (#6982) 2021-09-27 15:43:07 +02:00
Sam Kleinman
fb9eaf576a e2e: improve chances of statesyncing success (#7001)
This reduces this situation where a node will get stuck block syncing,
which seemed to happen a lot in last nights run.
2021-09-26 16:10:36 +00:00
Sam Kleinman
37ca98a544 e2e: reduce number of statesyncs in test networks (#6999) 2021-09-25 19:14:38 -04:00
Sam Kleinman
c101fa17ab e2e: add limit and sort to generator (#6998)
I observed a couple of problems with the generator in some recent tests: 

- there were a couple of hybrid test cases which did not have any
  legacy nodes (randomness and all.) I change the probability to
  produce more reliable results.

- added options to the generation to be able to add a max (to
  compliment the earlier min) number of nodes for local testing. 

- added an option to support reversing the sort order so "more
  complex" networks were first, as well as tweaked some of the point
  values. 

- this refactored the generators cli parsing to be a bit more clear.
2021-09-25 15:53:04 +00:00
M. J. Fromberger
118bfe2087 abci: Flush socket requests and responses immediately. (#6997)
The main effect of this change is to flush the socket client and server message
encoding buffers immediately once the message is fully and correctly encoded.
This allows us to remove the timer and some other special cases, without
changing the observed behaviour of the system.

-- Background

The socket protocol client and server each use a buffered writer to encode
request and response messages onto the underlying connection. This reduces the
possibility of a single message being split across multiple writes, but has the
side-effect that a request may remain buffered for some time.

The implementation worked around this by keeping a ticker that occasionally
triggers a flush, and by flushing the writer in response to an explicit request
baked into the client/server protocol (see also #6994).

These workarounds are both unnecessary: Once a message has been dequeued for
sending and fully encoded in wire format, there is no real use keeping all or
part of it buffered locally.  Moreover, using an asynchronous process to flush
the buffer makes the round-trip performance of the request unpredictable.

-- Benchmarks

Code: https://play.golang.org/p/0ChUOxJOiHt

I found no pre-existing performance benchmarks to justify the flush pattern,
but a natural question is whether this will significantly harm client/server
performance.  To test this, I implemented a simple benchmark that transfers
randomly-sized byte buffers from a no-op "client" to a no-op "server" over a
Unix-domain socket, using a buffered writer, both with and without explicit
flushes after each write.

As the following data show, flushing every time (FLUSH=true) does reduce raw
throughput, but not by a significant amount except for very small request
sizes, where the transfer time is already trivial (1.9μs).  Given that the
client is calibrated for 1MiB transactions, the overhead is not meaningful.

The percentage in each section is the speedup for flushing only when the buffer
is full, relative to flushing every block.  The benchmark uses the default
buffer size (4096 bytes), which is the same value used by the socket client and
server implementation:

  FLUSH  NBLOCKS  MAX      AVG     TOTAL       ELAPSED       TIME/BLOCK
  false  3957471  512      255     1011165416  2.00018873s   505ns
  true   1068568  512      255     273064368   2.000217051s  1.871µs
                                                             (73%)

  false  536096   4096     2048    1098066401  2.000229108s  3.731µs
  true   477911   4096     2047    978746731   2.000177825s  4.185µs
                                                             (10.8%)

  false  124595   16384    8181    1019340160  2.000235086s  16.053µs
  true   120995   16384    8179    989703064   2.000329349s  16.532µs
                                                             (2.9%)

  false  2114     1048576  525693  1111316541  2.000479928s  946.3µs
  true   2083     1048576  526379  1096449173  2.001817137s  961.025µs
                                                             (1.5%)

Note also that the FLUSH=false baseline is actually faster than the production
code, which flushes more often than is required by the buffer filling up.

Moreover, the timer slows down the overall transaction rate of the client and
server, indepenedent of how fast the socket transfer is, so the loss on a real
workload is probably much less.
2021-09-24 15:37:25 -07:00
Sam Kleinman
71c6682b57 statesync: clean up reactor/syncer lifecylce (#6995)
I've been noticing that there are a number of situations where the
statesync reactor blocks waiting for peers (or similar,) I've moved
things around to improve outcomes in local tests.
2021-09-24 21:40:12 +00:00