From 7ae27cb8562d00d00ba8b2fc2eff283f0de8b3a6 Mon Sep 17 00:00:00 2001 From: William Banfield Date: Fri, 10 Sep 2021 16:58:28 -0400 Subject: [PATCH] add todos for signature and event sections, cleanup based on pr feedback --- docs/rfc/rfc-003-performance-questions.md | 111 ++++++++++++++-------- 1 file changed, 72 insertions(+), 39 deletions(-) diff --git a/docs/rfc/rfc-003-performance-questions.md b/docs/rfc/rfc-003-performance-questions.md index 227075b88..32b2a187c 100644 --- a/docs/rfc/rfc-003-performance-questions.md +++ b/docs/rfc/rfc-003-performance-questions.md @@ -27,18 +27,6 @@ as we make changes to many of its subsystems. Performance is a central concern f upcoming decisions regarding the `p2p` protocol, RPC message encoding and structure, database usage and selection, and consensus protocol updates. -### References - -* [ADR 57](https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md): -Discussion of alternative RPC frameworks for Tendermint. -* [Issue 1319](https://github.com/tendermint/tendermint/issues/1319): More context -on signature aggregation. -* [Issue 2187](https://github.com/tendermint/tendermint/issues/2187): More context -on discussions of faster hashing algorithms in Tendermint. -* [ABCI Commit description](https://github.com/tendermint/spec/blob/master/spec/abci/apps.md#commit): -More context on locking the mempool during commit. -* [ABCI local client code](https://github.com/tendermint/tendermint/blob/511bd3eb7f037855a793a27ff4c53c12f085b570/abci/client/local_client.go#L84): -Clear look at lack of serialization in the local client. ## Discussion @@ -47,9 +35,12 @@ that are often cited as having performance issues. It raises questions and sugge lines of inquiry that may be valuable for better understanding Tendermint's performance issues. As a note: We should avoid quickly adding many microbenchmarks or package level benchmarks. -These are prone to being worse than useless as they can obscure what _should_ be -focused on: performance of the system from the perspective of a user. We should, -instead, tune performance with an eye towards actions user needs. +These are prone to being worse than useless as they can obscure what _should_ be +focused on: performance of the system from the perspective of a user. We should, +instead, tune performance with an eye towards actions user needs. These users comprise +both operators of Tendermint chains and the people generating transactions for +Tendermint chains. Both of these sets of users are largely aligned in wanting an end-to-end +system that operates quickly and efficiently. REQUEST: The list below may be incomplete, if there are additional sections that are often cited as creating poor performance, please comment so that they may be included. @@ -71,13 +62,18 @@ issue for the application. We need more data to understand the problem directly. We want to drive the popularity and adoption of Tendermint and this will mean allowing for chains with more validators. -We should follow up with users experiencing this issue. We may then want to benchmark -the P2P layer in a few simple ways: -1. By starting a network with 1000 nodes and sending no messages to determine memory -and CPU consumption in the presence of 1000 nodes. -2. By testing two nodes connected to each other via P2P, and attempting to -generate as much load between the two nodes as possible. This will help inform -how much load is incurred from message sends in an otherwise unencumbered P2P stack. +We should follow up with users experiencing this issue. We may then want to add +a series of metrics to the P2P layer to better understand the inefficiencies it produces. + +The following metrics can help us understand the sources of latency in the Tendermint P2P stack: + +* Number of messages sent and received per second +* Time of a message spent on the P2P layer send and receive queues + +The following metrics exist and should be leveraged in addition to those added: + +* Number of peers node's connected to +* Number of bytes per channel sent and received from each peer ### Sync @@ -99,11 +95,16 @@ syncing from. We should calculate how quickly this operation _could possibly_ complete for common chains and nodes. To calculate how quickly this operation could possibly complete, we should assume that a node is reading at line-rate of the NIC and writing at the full drive speed to its -local storage. +local storage. Comparing this theoretical upper-limit to the actual sync times +observed by node operators will give us a good point of comparison for understanding +how much overhead Tendermint incurs. -Comparing this theoretical upper-limit to the actual sync times observed by node operators will -give us a good point of comparison for understanding how much room for improvement there is -within Tendermint. +We should additionally add metrics to the blocksync operation to more clearly pinpoint +slow operations. The following metrics should be added to the block syncing operation: + +* Time to fetch and validate each block +* Time to execute a block +* Blocks sync'd per unit time ### Application @@ -119,12 +120,13 @@ transaction delivery in Tendermint occurs asynchronously and therefore appears u form a bottleneck in ABCI. After delivering all transactions, Tendermint then calls the `Commit` ABCI method. -Tendermint locks all access to the mempool while `Commit` proceeds. This means that -an application that is slow to execute all of its transactions or finalize state during -the `Commit` method will prevent any new transactions from being added to the mempool. -Apps that are slow to commit will prevent consensus from proceeded to the next consensus height -since Tendermint cannot validate validate block proposals or produce block proposals -without the AppHash obtained from the `Commit` method. +Tendermint [locks all access to the mempool][abci-commit-description] while `Commit` +proceeds. This means that an application that is slow to execute all of its +transactions or finalize state during the `Commit` method will prevent any new +transactions from being added to the mempool. Apps that are slow to commit will +prevent consensus from proceeded to the next consensus height since Tendermint +cannot validate validate block proposals or produce block proposals without the +AppHash obtained from the `Commit` method. #### ABCI serialization overhead causes slowdown @@ -132,7 +134,8 @@ The most common way to run a Tendermint application is using the Cosmos-SDK. The Cosmos-SDK runs the ABCI application within the same process as Tendermint. When an application is run in the same process as Tendermint, a serialization penalty is not paid. This is because the local ABCI client does not serialize method calls -and instead passes the protobuf type through through directly. +and instead passes the protobuf type through through directly. This can be seen +in [local_client.go][abci-local-client-code]. Serialization and deserialization in the gRPC and socket protocol ABCI methods may cause slowdown. While these may cause issue, they are not part of the primary @@ -153,14 +156,15 @@ likely invoked quite frequently on popular networks. Being able to perform effic on this common and critical operation is very important. The current JSON-RPC implementation relies heavily on type introspection via reflection, which is known to be very slow in Go. We should therefore produce benchmarks of this method to determine how much overhead -we are added to what is likely to be a very common operation. +we are adding to what, is likely to be, a very common operation. The other JSON-RPC methods are much less critical to the core functionality of Tendermint. While there may other points of performance consideration within the RPC, methods that do not receive high volumes of requests should not be prioritized for performance consideration. -NOTE: There is ongoing work to inspect and alter the JSON-RPC framework, so much of these -RPC-related performance considerations can either wait until that work is done or be +NOTE: Previous discussion of the RPC framework was done in [ADR 57][adr-57] and +there is ongoing work to inspect and alter the JSON-RPC framework in [RFC 002][rfc-002]. +Much of these RPC-related performance considerations can either wait until the work of RFC 002 work is done or be considered concordantly with the in-flight changes to the JSON-RPC. ### Protocol @@ -179,15 +183,44 @@ performance of the protocol. Given this that number of messages communicated has identified as a bottleneck, it would be extremely valuable to gather data on how long it takes for popular chains with many validators to gather all votes within a step. +Metrics that would improve visibility into this include: + +* Amount of time for a node to gather votes in a step. +* Number of votes each node sends to gossip (i.e. not its own votes, but votes it is +transmitting for a peer). +* Total number of votes each node sends to receives (A node may receive duplicate votes +so understanding how frequently this occurs will be valuable in evaluating the performance +of the gossip system). + #### Tx hashes Using a faster hash algorithm for Tx hashes is currently a point of discussion -in Tendermint. It is currently unknown if hashing transactions in the Mempool forms -a significant, although it does not appear to be documented as slow and the open -github issues do not show it as a frequent source of user pain. +in Tendermint. Namely, it is being considered as part of the [modular hashing proposal][modular-hashing]. +It is currently unknown if hashing transactions in the Mempool forms a significant bottleneck. +Although it does not appear to be documented as slow, there are a few open github +issues that indicate a possible user preference for a faster hashing algorithm, +including [issue 2187][issue-2187] and [issue 2186][issue-2186]. It is likely worth investigating what order of magnitude Tx hashing takes in comparison to other aspects of adding a Tx to the mempool. It is not currently clear if the rate of adding Tx to the mempool is a source of user pain. We should not endeavor to make large changes to consensus critical components without first being certain that the change is highly valuable and impactful. + +### Digital Signatures + +// TODO: (@wbanfield) + +### Tendermint Event System + +// TODO: (@wbanfield) + +### References +[modular-hashing]: https://github.com/tendermint/tendermint/pull/6773 +[issue-2186]: https://github.com/tendermint/tendermint/issues/2186 +[issue-2187]: https://github.com/tendermint/tendermint/issues/2187 +[rfc-002]: https://github.com/tendermint/tendermint/pull/6913 +[adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md +[issue-1319]: https://github.com/tendermint/tendermint/issues/1319 +[abci-commit-description]: https://github.com/tendermint/spec/blob/master/spec/abci/apps.md#commit +[abci-local-client-code]: https://github.com/tendermint/tendermint/blob/511bd3eb7f037855a793a27ff4c53c12f085b570/abci/client/local_client.go#L84