diff --git a/docs/rfc/README.md b/docs/rfc/README.md index 02bc3d9ca..3af5a33e8 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -40,6 +40,7 @@ sections. - [RFC-000: P2P Roadmap](./rfc-000-p2p-roadmap.rst) - [RFC-001: Storage Engines](./rfc-001-storage-engine.rst) - [RFC-002: Interprocess Communication](./rfc-002-ipc-ecosystem.md) +- [RFC-003: Performance Taxonomy](./rfc-003-performance-questions.md) - [RFC-004: E2E Test Framework Enhancements](./rfc-004-e2e-framework.md) diff --git a/docs/rfc/rfc-003-performance-questions.md b/docs/rfc/rfc-003-performance-questions.md new file mode 100644 index 000000000..b8111ffa2 --- /dev/null +++ b/docs/rfc/rfc-003-performance-questions.md @@ -0,0 +1,283 @@ +# RFC 003: Taxonomy of potential performance issues in Tendermint + +## Changelog + +- 2021-09-02: Created initial draft (@wbanfield) +- 2021-09-14: Add discussion of the event system (@wbanfield) + +## Abstract + +This document discusses the various sources of performance issues in Tendermint and +attempts to clarify what work may be required to understand and address them. + +## Background + +Performance, loosely defined as the ability of a software process to perform its work +quickly and efficiently under load and within reasonable resource limits, is a frequent +topic of discussion in the Tendermint project. +To effectively address any issues with Tendermint performance we need to +categorize the various issues, understand their potential sources, and gauge their +impact on users. + +Categorizing the different known performance issues will allow us to discuss and fix them +more systematically. This document proposes a rough taxonomy of performance issues +and highlights areas where more research into potential performance problems is required. + +Understanding Tendermint's performance limitations will also be critically important +as we make changes to many of its subsystems. Performance is a central concern for +upcoming decisions regarding the `p2p` protocol, RPC message encoding and structure, +database usage and selection, and consensus protocol updates. + + +## Discussion + +This section attempts to delineate the different sections of Tendermint functionality +that are often cited as having performance issues. It raises questions and suggests +lines of inquiry that may be valuable for better understanding Tendermint's performance issues. + +As a note: We should avoid quickly adding many microbenchmarks or package level benchmarks. +These are prone to being worse than useless as they can obscure what _should_ be +focused on: performance of the system from the perspective of a user. We should, +instead, tune performance with an eye towards user needs and actions users make. These users comprise +both operators of Tendermint chains and the people generating transactions for +Tendermint chains. Both of these sets of users are largely aligned in wanting an end-to-end +system that operates quickly and efficiently. + +REQUEST: The list below may be incomplete, if there are additional sections that are often +cited as creating poor performance, please comment so that they may be included. + +### P2P + +#### Claim: Tendermint cannot scale to large numbers of nodes + +A complaint has been reported that Tendermint networks cannot scale to large numbers of nodes. +The listed number of nodes a user reported as causing issue was in the thousands. +We don't currently have evidence about what the upper-limit of nodes that Tendermint's +P2P stack can scale to. + +We need to more concretely understand the source of issues and determine what layer +is causing a problem. It's possible that the P2P layer, in the absence of any reactors +sending data, is perfectly capable of managing thousands of peer connections. For +a reasonable networking and application setup, thousands of connections should not present any +issue for the application. + +We need more data to understand the problem directly. We want to drive the popularity +and adoption of Tendermint and this will mean allowing for chains with more validators. +We should follow up with users experiencing this issue. We may then want to add +a series of metrics to the P2P layer to better understand the inefficiencies it produces. + +The following metrics can help us understand the sources of latency in the Tendermint P2P stack: + +* Number of messages sent and received per second +* Time of a message spent on the P2P layer send and receive queues + +The following metrics exist and should be leveraged in addition to those added: + +* Number of peers node's connected to +* Number of bytes per channel sent and received from each peer + +### Sync + +#### Claim: Block Syncing is slow + +Bootstrapping a new node in a network to the height of the rest of the network is believed to +take longer than users would like. Block sync requires fetching all of the blocks from +peers and placing them into the local disk for storage. A useful line of inquiry +is understanding how quickly a perfectly tuned system _could_ fetch all of the state +over a network so that we understand how much overhead Tendermint actually adds. + +The operation is likely to be _incredibly_ dependent on the environment in which +the node is being run. The factors that will influence syncing include: +1. Number of peers that a syncing node may fetch from. +2. Speed of the disk that a validator is writing to. +3. Speed of the network connection between the different peers that node is +syncing from. + +We should calculate how quickly this operation _could possibly_ complete for common chains and nodes. +To calculate how quickly this operation could possibly complete, we should assume that +a node is reading at line-rate of the NIC and writing at the full drive speed to its +local storage. Comparing this theoretical upper-limit to the actual sync times +observed by node operators will give us a good point of comparison for understanding +how much overhead Tendermint incurs. + +We should additionally add metrics to the blocksync operation to more clearly pinpoint +slow operations. The following metrics should be added to the block syncing operation: + +* Time to fetch and validate each block +* Time to execute a block +* Blocks sync'd per unit time + +### Application + +Applications performing complex state transitions have the potential to bottleneck +the Tendermint node. + +#### Claim: ABCI block delivery could cause slowdown + +ABCI delivers blocks in several methods: `BeginBlock`, `DeliverTx`, `EndBlock`, `Commit`. + +Tendermint delivers transactions one-by-one via the `DeliverTx` call. Most of the +transaction delivery in Tendermint occurs asynchronously and therefore appears unlikely to +form a bottleneck in ABCI. + +After delivering all transactions, Tendermint then calls the `Commit` ABCI method. +Tendermint [locks all access to the mempool][abci-commit-description] while `Commit` +proceeds. This means that an application that is slow to execute all of its +transactions or finalize state during the `Commit` method will prevent any new +transactions from being added to the mempool. Apps that are slow to commit will +prevent consensus from proceeded to the next consensus height since Tendermint +cannot validate block proposals or produce block proposals without the +AppHash obtained from the `Commit` method. We should add a metric for each +step in the ABCI protocol to track the amount of time that a node spends communicating +with the application at each step. + +#### Claim: ABCI serialization overhead causes slowdown + +The most common way to run a Tendermint application is using the Cosmos-SDK. +The Cosmos-SDK runs the ABCI application within the same process as Tendermint. +When an application is run in the same process as Tendermint, a serialization penalty +is not paid. This is because the local ABCI client does not serialize method calls +and instead passes the protobuf type through directly. This can be seen +in [local_client.go][abci-local-client-code]. + +Serialization and deserialization in the gRPC and socket protocol ABCI methods +may cause slowdown. While these may cause issue, they are not part of the primary +usecase of Tendermint and do not necessarily need to be addressed at this time. + +### RPC + +#### Claim: The Query API is slow. + +The query API locks a mutex across the ABCI connections. This causes consensus to +slow during queries, as ABCI is no longer able to make progress. This is known +to be causing issue in the cosmos-sdk and is being addressed [in the sdk][sdk-query-fix] +but a more robust solution may be required. Adding metrics to each ABCI client connection +and message as described in the Application section of this document would allow us +to further introspect the issue here. + +#### Claim: RPC Serialization may cause slowdown + +The Tendermint RPC uses a modified version of JSON-RPC. This RPC powers the `broadcast_tx_*` methods, +which is a critical method for adding transactions to Tendermint at the moment. This method is +likely invoked quite frequently on popular networks. Being able to perform efficiently +on this common and critical operation is very important. The current JSON-RPC implementation +relies heavily on type introspection via reflection, which is known to be very slow in +Go. We should therefore produce benchmarks of this method to determine how much overhead +we are adding to what, is likely to be, a very common operation. + +The other JSON-RPC methods are much less critical to the core functionality of Tendermint. +While there may other points of performance consideration within the RPC, methods that do not +receive high volumes of requests should not be prioritized for performance consideration. + +NOTE: Previous discussion of the RPC framework was done in [ADR 57][adr-57] and +there is ongoing work to inspect and alter the JSON-RPC framework in [RFC 002][rfc-002]. +Much of these RPC-related performance considerations can either wait until the work of RFC 002 work is done or be +considered concordantly with the in-flight changes to the JSON-RPC. + +### Protocol + +#### Claim: Gossiping messages is a slow process + +Currently, for any validator to successfully vote in a consensus _step_, it must +receive votes from greater than 2/3 of the validators on the network. In many cases, +it's preferable to receive as many votes as possible from correct validators. + +This produces a quadratic increase in messages that are communicated as more validators join the network. +(Each of the N validators must communicate with all other N-1 validators). + +This large number of messages communicated per step has been identified to impact +performance of the protocol. Given that the number of messages communicated has been +identified as a bottleneck, it would be extremely valuable to gather data on how long +it takes for popular chains with many validators to gather all votes within a step. + +Metrics that would improve visibility into this include: + +* Amount of time for a node to gather votes in a step. +* Amount of time for a node to gather all block parts. +* Number of votes each node sends to gossip (i.e. not its own votes, but votes it is +transmitting for a peer). +* Total number of votes each node sends to receives (A node may receive duplicate votes +so understanding how frequently this occurs will be valuable in evaluating the performance +of the gossip system). + +#### Claim: Hashing Txs causes slowdown in Tendermint + +Using a faster hash algorithm for Tx hashes is currently a point of discussion +in Tendermint. Namely, it is being considered as part of the [modular hashing proposal][modular-hashing]. +It is currently unknown if hashing transactions in the Mempool forms a significant bottleneck. +Although it does not appear to be documented as slow, there are a few open github +issues that indicate a possible user preference for a faster hashing algorithm, +including [issue 2187][issue-2187] and [issue 2186][issue-2186]. + +It is likely worth investigating what order of magnitude Tx hashing takes in comparison to other +aspects of adding a Tx to the mempool. It is not currently clear if the rate of adding Tx +to the mempool is a source of user pain. We should not endeavor to make large changes to +consensus critical components without first being certain that the change is highly +valuable and impactful. + +### Digital Signatures + +#### Claim: Verification of digital signatures may cause slowdown in Tendermint + +Working with cryptographic signatures can be computationally expensive. The cosmos +hub uses [ed25519 signatures][hub-signature]. The library performing signature +verification in Tendermint on votes is [benchmarked][ed25519-bench] to be able to perform an `ed25519` +signature in 75μs on a decently fast CPU. A validator in the Cosmos Hub performs +3 sets of verifications on the signatures of the 140 validators in the Hub +in a consensus round, during block verification, when verifying the prevotes, and +when verifying the precommits. With no batching, this would be roughly `3ms` per +round. It is quite unlikely, therefore, that this accounts for any serious amount +of the ~7 seconds of block time per height in the Hub. + +This may cause slowdown when syncing, since the process needs to constantly verify +signatures. It's possible that improved signature aggregation will lead to improved +light client or other syncing performance. In general, a metric should be added +to track block rate while blocksyncing. + +#### Claim: Our use of digital signatures in the consensus protocol contributes to performance issue + +Currently, Tendermint's digital signature verification requires that all validators +receive all vote messages. Each validator must receive the complete digital signature +along with the vote message that it corresponds to. This means that all N validators +must receive messages from at least 2/3 of the N validators in each consensus +round. Given the potential for oddly shaped network topologies and the expected +variable network roundtrip times of a few hundred milliseconds in a blockchain, +it is highly likely that this amount of gossiping is leading to a significant amount +of the slowdown in the Cosmos Hub and in Tendermint consensus. + +### Tendermint Event System + +#### Claim: The event system is a bottleneck in Tendermint + +The Tendermint Event system is used to communicate and store information about +internal Tendermint execution. The system uses channels internally to send messages +to different subscribers. Sending an event [blocks on the internal channel][event-send]. +The default configuration is to [use an unbuffered channel for event publishes][event-buffer-capacity]. +Several consumers of the event system also use an unbuffered channel for reads. +An example of this is the [event indexer][event-indexer-unbuffered], which takes an +unbuffered subscription to the event system. The result is that these unbuffered readers +can cause writes to the event system to block or slow down depending on contention in the +event system. This has implications for the consensus system, which [publishes events][consensus-event-send]. +To better understand the performance of the event system, we should add metrics to track the timing of +event sends. The following metrics would be a good start for tracking this performance: + +* Time in event send, labeled by Event Type +* Time in event receive, labeled by subscriber +* Event throughput, measured in events per unit time. + +### References +[modular-hashing]: https://github.com/tendermint/tendermint/pull/6773 +[issue-2186]: https://github.com/tendermint/tendermint/issues/2186 +[issue-2187]: https://github.com/tendermint/tendermint/issues/2187 +[rfc-002]: https://github.com/tendermint/tendermint/pull/6913 +[adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md +[issue-1319]: https://github.com/tendermint/tendermint/issues/1319 +[abci-commit-description]: https://github.com/tendermint/spec/blob/master/spec/abci/apps.md#commit +[abci-local-client-code]: https://github.com/tendermint/tendermint/blob/511bd3eb7f037855a793a27ff4c53c12f085b570/abci/client/local_client.go#L84 +[hub-signature]: https://github.com/cosmos/gaia/blob/0ecb6ed8a244d835807f1ced49217d54a9ca2070/docs/resources/genesis.md#consensus-parameters +[ed25519-bench]: https://github.com/oasisprotocol/curve25519-voi/blob/d2e7fc59fe38c18ca990c84c4186cba2cc45b1f9/PERFORMANCE.md +[event-send]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/libs/pubsub/pubsub.go#L338 +[event-buffer-capacity]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/types/event_bus.go#L14 +[event-indexer-unbuffered]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/state/indexer/indexer_service.go#L39 +[consensus-event-send]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/internal/consensus/state.go#L1573 +[sdk-query-fix]: https://github.com/cosmos/cosmos-sdk/pull/10045 diff --git a/internal/consensus/state.go b/internal/consensus/state.go index 6379b71d5..16f2bb0bd 100644 --- a/internal/consensus/state.go +++ b/internal/consensus/state.go @@ -137,7 +137,7 @@ type State struct { done chan struct{} // synchronous pubsub between consensus state and reactor. - // state only emits EventNewRoundStep and EventVote + // state only emits EventNewRoundStep, EventValidBlock, and EventVote evsw tmevents.EventSwitch // for reporting metrics