mirror of
https://github.com/tendermint/tendermint.git
synced 2026-02-07 12:30:45 +00:00
Remove ADR and RFC docs from the v0.35.x backport branch. (#7866)
This commit is contained in:
@@ -1,47 +0,0 @@
|
||||
---
|
||||
order: 1
|
||||
parent:
|
||||
order: false
|
||||
---
|
||||
|
||||
# Requests for Comments
|
||||
|
||||
A Request for Comments (RFC) is a record of discussion on an open-ended topic
|
||||
related to the design and implementation of Tendermint Core, for which no
|
||||
immediate decision is required.
|
||||
|
||||
The purpose of an RFC is to serve as a historical record of a high-level
|
||||
discussion that might otherwise only be recorded in an ad hoc way (for example,
|
||||
via gists or Google docs) that are difficult to discover for someone after the
|
||||
fact. An RFC _may_ give rise to more specific architectural _decisions_ for
|
||||
Tendermint, but those decisions must be recorded separately in [Architecture
|
||||
Decision Records (ADR)](./../architecture).
|
||||
|
||||
As a rule of thumb, if you can articulate a specific question that needs to be
|
||||
answered, write an ADR. If you need to explore the topic and get input from
|
||||
others to know what questions need to be answered, an RFC may be appropriate.
|
||||
|
||||
## RFC Content
|
||||
|
||||
An RFC should provide:
|
||||
|
||||
- A **changelog**, documenting when and how the RFC has changed.
|
||||
- An **abstract**, briefly summarizing the topic so the reader can quickly tell
|
||||
whether it is relevant to their interest.
|
||||
- Any **background** a reader will need to understand and participate in the
|
||||
substance of the discussion (links to other documents are fine here).
|
||||
- The **discussion**, the primary content of the document.
|
||||
|
||||
The [rfc-template.md](./rfc-template.md) file includes placeholders for these
|
||||
sections.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [RFC-000: P2P Roadmap](./rfc-000-p2p-roadmap.rst)
|
||||
- [RFC-001: Storage Engines](./rfc-001-storage-engine.rst)
|
||||
- [RFC-002: Interprocess Communication](./rfc-002-ipc-ecosystem.md)
|
||||
- [RFC-003: Performance Taxonomy](./rfc-003-performance-questions.md)
|
||||
- [RFC-004: E2E Test Framework Enhancements](./rfc-004-e2e-framework.md)
|
||||
- [RFC-005: Event System](./rfc-005-event-system.rst)
|
||||
|
||||
<!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
|
||||
@@ -1,316 +0,0 @@
|
||||
====================
|
||||
RFC 000: P2P Roadmap
|
||||
====================
|
||||
|
||||
Changelog
|
||||
---------
|
||||
|
||||
- 2021-08-20: Completed initial draft and distributed via a gist
|
||||
- 2021-08-25: Migrated as an RFC and changed format
|
||||
|
||||
Abstract
|
||||
--------
|
||||
|
||||
This document discusses the future of peer network management in Tendermint, with
|
||||
a particular focus on features, semantics, and a proposed roadmap.
|
||||
Specifically, we consider libp2p as a tool kit for implementing some fundamentals.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
For the 0.35 release cycle the switching/routing layer of Tendermint was
|
||||
replaced. This work was done "in place," and produced a version of Tendermint
|
||||
that was backward-compatible and interoperable with previous versions of the
|
||||
software. While there are new p2p/peer management constructs in the new
|
||||
version (e.g. ``PeerManager`` and ``Router``), the main effect of this change
|
||||
was to simplify the ways that other components within Tendermint interacted with
|
||||
the peer management layer, and to make it possible for higher-level components
|
||||
(specifically the reactors), to be used and tested more independently.
|
||||
|
||||
This refactoring, which was a major undertaking, was entirely necessary to
|
||||
enable areas for future development and iteration on this aspect of
|
||||
Tendermint. There are also a number of potential user-facing features that
|
||||
depend heavily on the p2p layer: additional transport protocols, transport
|
||||
compression, improved resilience to network partitions. These improvements to
|
||||
modularity, stability, and reliability of the p2p system will also make
|
||||
ongoing maintenance and feature development easier in the rest of Tendermint.
|
||||
|
||||
Critique of Current Peer-to-Peer Infrastructure
|
||||
---------------------------------------
|
||||
|
||||
The current (refactored) P2P stack is an improvement on the previous iteration
|
||||
(legacy), but as of 0.35, there remains room for improvement in the design and
|
||||
implementation of the P2P layer.
|
||||
|
||||
Some limitations of the current stack include:
|
||||
|
||||
- heavy reliance on buffering to avoid backups in the flow of components,
|
||||
which is fragile to maintain and can lead to unexpected memory usage
|
||||
patterns and forces the routing layer to make decisions about when messages
|
||||
should be discarded.
|
||||
|
||||
- the current p2p stack relies on convention (rather than the compiler) to
|
||||
enforce the API boundaries and conventions between reactors and the router,
|
||||
making it very easy to write "wrong" reactor code or introduce a bad
|
||||
dependency.
|
||||
|
||||
- the current stack is probably more complex and difficult to maintain because
|
||||
the legacy system must coexist with the new components in 0.35. When the
|
||||
legacy stack is removed there are some simple changes that will become
|
||||
possible and could reduce the complexity of the new system. (e.g. `#6598
|
||||
<https://github.com/tendermint/tendermint/issues/6598>`_.)
|
||||
|
||||
- the current stack encapsulates a lot of information about peers, and makes it
|
||||
difficult to expose that information to monitoring/observability tools. This
|
||||
general opacity also makes it difficult to interact with the peer system
|
||||
from other areas of the code base (e.g. tests, reactors).
|
||||
|
||||
- the legacy stack provided some control to operators to force the system to
|
||||
dial new peers or seed nodes or manipulate the topology of the system _in
|
||||
situ_. The current stack can't easily provide this, and while the new stack
|
||||
may have better behavior, it does leave operators hands tied.
|
||||
|
||||
Some of these issues will be resolved early in the 0.36 cycle, with the
|
||||
removal of the legacy components.
|
||||
|
||||
The 0.36 release also provides the opportunity to make changes to the
|
||||
protocol, as the release will not be compatible with previous releases.
|
||||
|
||||
Areas for Development
|
||||
---------------------
|
||||
|
||||
These sections describe features that may make sense to include in a Phase 2 of
|
||||
a P2P project.
|
||||
|
||||
Internal Message Passing
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Currently, there's no provision for intranode communication using the P2P
|
||||
layer, which means when two reactors need to interact with each other they
|
||||
have to have dependencies on each other's interfaces, and
|
||||
initialization. Changing these interactions (e.g. transitions between
|
||||
blocksync and consensus) from procedure calls to message passing.
|
||||
|
||||
This is a relatively simple change and could be implemented with the following
|
||||
components:
|
||||
|
||||
- a constant to represent "local" delivery as the ``To`` field on
|
||||
``p2p.Envelope``.
|
||||
|
||||
- special path for routing local messages that doesn't require message
|
||||
serialization (protobuf marshalling/unmarshaling).
|
||||
|
||||
Adding these semantics, particularly if in conjunction with synchronous
|
||||
semantics provides a solution to dependency graph problems currently present
|
||||
in the Tendermint codebase, which will simplify development, make it possible
|
||||
to isolate components for testing.
|
||||
|
||||
Eventually, this will also make it possible to have a logical Tendermint node
|
||||
running in multiple processes or in a collection of containers, although the
|
||||
usecase of this may be debatable.
|
||||
|
||||
Synchronous Semantics (Paired Request/Response)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In the current system, all messages are sent with fire-and-forget semantics,
|
||||
and there's no coupling between a request sent via the p2p layer, and a
|
||||
response. These kinds of semantics would simplify the implementation of
|
||||
state and block sync reactors, and make intra-node message passing more
|
||||
powerful.
|
||||
|
||||
For some interactions, like gossiping transactions between the mempools of
|
||||
different nodes, fire-and-forget semantics make sense, but for other
|
||||
operations the missing link between requests/responses leads to either
|
||||
inefficiency when a node fails to respond or becomes unavailable, or code that
|
||||
is just difficult to follow.
|
||||
|
||||
To support this kind of work, the protocol would need to accommodate some kind
|
||||
of request/response ID to allow identifying out-of-order responses over a
|
||||
single connection. Additionally, expanded the programming model of the
|
||||
``p2p.Channel`` to accommodate some kind of _future_ or similar paradigm to
|
||||
make it viable to write reactor code without needing for the reactor developer
|
||||
to wrestle with lower level concurrency constructs.
|
||||
|
||||
|
||||
Timeout Handling (QoS)
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Currently, all timeouts, buffering, and QoS features are handled at the router
|
||||
layer, and the reactors are implemented in ways that assume/require
|
||||
asynchronous operation. This both increases the required complexity at the
|
||||
routing layer, and means that misbehavior at the reactor level is difficult to
|
||||
detect or attribute. Additionally, the current system provides three main
|
||||
parameters to control quality of service:
|
||||
|
||||
- buffer sizes for channels and queues.
|
||||
|
||||
- priorities for channels
|
||||
|
||||
- queue implementation details for shedding load.
|
||||
|
||||
These end up being quite coarse controls, and changing the settings are
|
||||
difficult because as the queues and channels are able to buffer large numbers
|
||||
of messages it can be hard to see the impact of a given change, particularly
|
||||
in our extant test environment. In general, we should endeavor to:
|
||||
|
||||
- set real timeouts, via contexts, on most message send operations, so that
|
||||
senders rather than queues can be responsible for timeout
|
||||
logic. Additionally, this will make it possible to avoid sending messages
|
||||
during shutdown.
|
||||
|
||||
- reduce (to the greatest extent possible) the amount of buffering in
|
||||
channels and the queues, to more readily surface backpressure and reduce the
|
||||
potential for buildup of stale messages.
|
||||
|
||||
Stream Based Connection Handling
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Currently the transport layer is message based, which makes sense from a
|
||||
mental model of how the protocol works, but makes it more difficult to
|
||||
implement transports and connection types, as it forces a higher level view of
|
||||
the connection and interaction which makes it harder to implement for novel
|
||||
transport types and makes it more likely that message-based caching and rate
|
||||
limiting will be implemented at the transport layer rather than at a more
|
||||
appropriate level.
|
||||
|
||||
The transport then, would be responsible for negotiating the connection and the
|
||||
handshake and otherwise behave like a socket/file descriptor with ``Read`` and
|
||||
``Write`` methods.
|
||||
|
||||
While this was included in the initial design for the new P2P layer, it may be
|
||||
obviated entirely if the transport and peer layer is replaced with libp2p,
|
||||
which is primarily stream based.
|
||||
|
||||
Service Discovery
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
In the current system, Tendermint assumes that all nodes in a network are
|
||||
largely equivalent, and nodes tend to be "chatty" making many requests of
|
||||
large numbers of peers and waiting for peers to (hopefully) respond. While
|
||||
this works and has allowed Tendermint to get to a certain point, this both
|
||||
produces a theoretical scaling bottle neck and makes it harder to test and
|
||||
verify components of the system.
|
||||
|
||||
In addition to peer's identity and connection information, peers should be
|
||||
able to advertise a number of services or capabilities, and node operators or
|
||||
developers should be able to specify peer capability requirements (e.g. target
|
||||
at least <x>-percent of peers with <y> capability.)
|
||||
|
||||
These capabilities may be useful in selecting peers to send messages to, it
|
||||
may make sense to extend Tendermint's message addressing capability to allow
|
||||
reactors to send messages to groups of peers based on role rather than only
|
||||
allowing addressing to one or all peers.
|
||||
|
||||
Having a good service discovery mechanism may pair well with the synchronous
|
||||
semantics (request/response) work, as it allows reactors to "make a request of
|
||||
a peer with <x> capability and wait for the response," rather force the
|
||||
reactors to need to track the capabilities or state of specific peers.
|
||||
|
||||
Solutions
|
||||
---------
|
||||
|
||||
Continued Homegrown Implementation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The current peer system is homegrown and is conceptually compatible with the
|
||||
needs of the project, and while there are limitations to the system, the p2p
|
||||
layer is not (currently as of 0.35) a major source of bugs or friction during
|
||||
development.
|
||||
|
||||
However, the current implementation makes a number of allowances for
|
||||
interoperability, and there are a collection of iterative improvements that
|
||||
should be considered in the next couple of releases. To maintain the current
|
||||
implementation, upcoming work would include:
|
||||
|
||||
- change the ``Transport`` mechanism to facilitate easier implementations.
|
||||
|
||||
- implement different ``Transport`` handlers to be able to manage peer
|
||||
connections using different protocols (e.g. QUIC, etc.)
|
||||
|
||||
- entirely remove the constructs and implementations of the legacy peer
|
||||
implementation.
|
||||
|
||||
- establish and enforce clearer chains of responsibility for connection
|
||||
establishment (e.g. handshaking, setup,) which is currently shared between
|
||||
three components.
|
||||
|
||||
- report better metrics regarding the into the state of peers and network
|
||||
connectivity, which are opaque outside of the system. This is constrained at
|
||||
the moment as a side effect of the split responsibility for connection
|
||||
establishment.
|
||||
|
||||
- extend the PEX system to include service information so that nodes in the
|
||||
network weren't necessarily homogeneous.
|
||||
|
||||
While maintaining a bespoke peer management layer would seem to distract from
|
||||
development of core functionality, the truth is that (once the legacy code is
|
||||
removed,) the scope of the peer layer is relatively small from a maintenance
|
||||
perspective, and having control at this layer might actually afford the
|
||||
project with the ability to more rapidly iterate on some features.
|
||||
|
||||
LibP2P
|
||||
~~~~~~
|
||||
|
||||
LibP2P provides components that, approximately, account for the
|
||||
``PeerManager`` and ``Transport`` components of the current (new) P2P
|
||||
stack. The Go APIs seem reasonable, and being able to externalize the
|
||||
implementation details of peer and connection management seems like it could
|
||||
provide a lot of benefits, particularly in supporting a more active ecosystem.
|
||||
|
||||
In general the API provides the kind of stream-based, multi-protocol
|
||||
supporting, and idiomatic baseline for implementing a peer layer. Additionally
|
||||
because it handles peer exchange and connection management at a lower
|
||||
level, by using libp2p it'd be possible to remove a good deal of code in favor
|
||||
of just using libp2p. Having said that, Tendermint's P2P layer covers a
|
||||
greater scope (e.g. message routing to different peers) and that layer is
|
||||
something that Tendermint might want to retain.
|
||||
|
||||
The are a number of unknowns that require more research including how much of
|
||||
a peer database the Tendermint engine itself needs to maintain, in order to
|
||||
support higher level operations (consensus, statesync), but it might be the
|
||||
case that our internal systems need to know much less about peers than
|
||||
otherwise specified. Similarly, the current system has a notion of peer
|
||||
scoring that cannot be communicated to libp2p, which may be fine as this is
|
||||
only used to support peer exchange (PEX,) which would become a property libp2p
|
||||
and not expressed in it's current higher-level form.
|
||||
|
||||
In general, the effort to switch to libp2p would involve:
|
||||
|
||||
- timing it during an appropriate protocol-breaking window, as it doesn't seem
|
||||
viable to support both libp2p *and* the current p2p protocol.
|
||||
|
||||
- providing some in-memory testing network to support the use case that the
|
||||
current ``p2p.MemoryNetwork`` provides.
|
||||
|
||||
- re-homing the ``p2p.Router`` implementation on top of libp2p components to
|
||||
be able to maintain the current reactor implementations.
|
||||
|
||||
Open question include:
|
||||
|
||||
- how much local buffering should we be doing? It sort of seems like we should
|
||||
figure out what the expected behavior is for libp2p for QoS-type
|
||||
functionality, and if our requirements mean that we should be implementing
|
||||
this on top of things ourselves?
|
||||
|
||||
- if Tendermint was going to use libp2p, how would libp2p's stability
|
||||
guarantees (protocol, etc.) impact/constrain Tendermint's stability
|
||||
guarantees?
|
||||
|
||||
- what kind of introspection does libp2p provide, and to what extend would
|
||||
this change or constrain the kind of observability that Tendermint is able
|
||||
to provide?
|
||||
|
||||
- how do efforts to select "the best" (healthy, close, well-behaving, etc.)
|
||||
peers work out if Tendermint is not maintaining a local peer database?
|
||||
|
||||
- would adding additional higher level semantics (internal message passing,
|
||||
request/response pairs, service discovery, etc.) facilitate removing some of
|
||||
the direct linkages between constructs/components in the system and reduce
|
||||
the need for Tendermint nodes to maintain state about its peers?
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
- `Tracking Ticket for P2P Refactor Project <https://github.com/tendermint/tendermint/issues/5670>`_
|
||||
- `ADR 61: P2P Refactor Scope <../architecture/adr-061-p2p-refactor-scope.md>`_
|
||||
- `ADR 62: P2P Architecture and Abstraction <../architecture/adr-061-p2p-architecture.md>`_
|
||||
@@ -1,179 +0,0 @@
|
||||
===========================================
|
||||
RFC 001: Storage Engines and Database Layer
|
||||
===========================================
|
||||
|
||||
Changelog
|
||||
---------
|
||||
|
||||
- 2021-04-19: Initial Draft (gist)
|
||||
- 2021-09-02: Migrated to RFC folder, with some updates
|
||||
|
||||
Abstract
|
||||
--------
|
||||
|
||||
The aspect of Tendermint that's responsible for persistence and storage (often
|
||||
"the database" internally) represents a bottle neck in the architecture of the
|
||||
platform, that the 0.36 release presents a good opportunity to correct. The
|
||||
current storage engine layer provides a great deal of flexibility that is
|
||||
difficult for users to leverage or benefit from, while also making it harder
|
||||
for Tendermint Core developers to deliver improvements on storage engine. This
|
||||
RFC discusses the possible improvements to this layer of the system.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
Tendermint has a very thin common wrapper that makes Tendermint itself
|
||||
(largely) agnostic to the data storage layer (within the realm of the popular
|
||||
key-value/embedded databases.) This flexibility is not particularly useful:
|
||||
the benefits of a specific database engine in the context of Tendermint is not
|
||||
particularly well understood, and the maintenance burden for multiple backends
|
||||
is not commensurate with the benefit provided. Additionally, because the data
|
||||
storage layer is handled generically, and most tests run with an in-memory
|
||||
framework, it's difficult to take advantage of any higher-level features of a
|
||||
database engine.
|
||||
|
||||
Ideally, developers within Tendermint will be able to interact with persisted
|
||||
data via an interface that can function, approximately like an object
|
||||
store, and this storage interface will be able to accommodate all existing
|
||||
persistence workloads (e.g. block storage, local peer management information
|
||||
like the "address book", crash-recovery log like the WAL.) In addition to
|
||||
providing a more ergonomic interface and new semantics, by selecting a single
|
||||
storage engine tendermint can use native durability and atomicity features of
|
||||
the storage engine and simplify its own implementations.
|
||||
|
||||
Data Access Patterns
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Tendermint's data access patterns have the following characteristics:
|
||||
|
||||
- aggregate data size often exceeds memory.
|
||||
|
||||
- data is rarely mutated after it's written for most data (e.g. blocks), but
|
||||
small amounts of working data is persisted by nodes and is frequently
|
||||
mutated (e.g. peer information, validator information.)
|
||||
|
||||
- read patterns can be quite random.
|
||||
|
||||
- crash resistance and crash recovery, provided by write-ahead-logs (in
|
||||
consensus, and potentially for the mempool) should allow the system to
|
||||
resume work after an unexpected shut down.
|
||||
|
||||
Project Goals
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
As we think about replacing the current persistence layer, we should consider
|
||||
the following high level goals:
|
||||
|
||||
- drop dependencies on storage engines that have a CGo dependency.
|
||||
|
||||
- encapsulate data format and data storage from higher-level services
|
||||
(e.g. reactors) within tendermint.
|
||||
|
||||
- select a storage engine that does not incur any additional operational
|
||||
complexity (e.g. database should be embedded.)
|
||||
|
||||
- provide database semantics with sufficient ACID, snapshots, and
|
||||
transactional support.
|
||||
|
||||
Open Questions
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
The following questions remain:
|
||||
|
||||
- what kind of data-access concurrency does tendermint require?
|
||||
|
||||
- would tendermint users SDK/etc. benefit from some shared database
|
||||
infrastructure?
|
||||
|
||||
- In earlier conversations it seemed as if the SDK has selected Badger and
|
||||
RocksDB for their storage engines, and it might make sense to be able to
|
||||
(optionally) pass a handle to a Badger instance between the libraries in
|
||||
some cases.
|
||||
|
||||
- what are typical data sizes, and what kinds of memory sizes can we expect
|
||||
operators to be able to provide?
|
||||
|
||||
- in addition to simple persistence, what kind of additional semantics would
|
||||
tendermint like to enjoy (e.g. transactional semantics, unique constraints,
|
||||
indexes, in-place-updates, etc.)?
|
||||
|
||||
Decision Framework
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Given the constraint of removing the CGo dependency, the decision is between
|
||||
"badger" and "boltdb" (in the form of the etcd/CoreOS fork,) as low level. On
|
||||
top of this and somewhat orthogonally, we must also decide on the interface to
|
||||
the database and how the larger application will have to interact with the
|
||||
database layer. Users of the data layer shouldn't ever need to interact with
|
||||
raw byte slices from the database, and should mostly have the experience of
|
||||
interacting with Go-types.
|
||||
|
||||
Badger is more consistently developed and has a broader feature set than
|
||||
Bolt. At the same time, Badger is likely more memory intensive and may have
|
||||
more overhead in terms of open file handles given it's model. At first glance,
|
||||
Badger is the obvious choice: it's actively developed and it has a lot of
|
||||
features that could be useful. Bolt is not without some benefits: it's stable
|
||||
and is maintained by the etcd folks, it's simpler model (single memory mapped
|
||||
file, etc,) may be easier to reason about.
|
||||
|
||||
I propose that we consider the following specific questions about storage
|
||||
engines:
|
||||
|
||||
- does Badger's evolving development, which may result in data file format
|
||||
changes in the future, and could restrict our access to using the latest
|
||||
version of the library between major upgrades, present a problem?
|
||||
|
||||
- do we do we have goals/concerns about memory footprint that Badger may
|
||||
prevent us from hitting, particularly as data sets grow over time?
|
||||
|
||||
- what kind of additional tooling might we need/like to build (dump/restore,
|
||||
etc.)?
|
||||
|
||||
- do we want to run unit/integration tests against a data files on disk rather
|
||||
than relying exclusively on the memory database?
|
||||
|
||||
Project Scope
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
This project will consist of the following aspects:
|
||||
|
||||
- selecting a storage engine, and modifying the tendermint codebase to
|
||||
disallow any configuration of the storage engine outside of the tendermint.
|
||||
|
||||
- remove the dependency on the current tm-db interfaces and replace with some
|
||||
internalized, safe, and ergonomic interface for data persistence with all
|
||||
required database semantics.
|
||||
|
||||
- update core tendermint code to use the new interface and data tools.
|
||||
|
||||
Next Steps
|
||||
~~~~~~~~~~
|
||||
|
||||
- circulate the RFC, and discuss options with appropriate stakeholders.
|
||||
|
||||
- write brief ADR to summarize decisions around technical decisions reached
|
||||
during the RFC phase.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
- `bolddb <https://github.com/etcd-io/bbolt>`_
|
||||
- `badger <https://github.com/dgraph-io/badger>`_
|
||||
- `badgerdb overview <https://dbdb.io/db/badgerdb>`_
|
||||
- `botldb overview <https://dbdb.io/db/boltdb>`_
|
||||
- `boltdb vs badger <https://tech.townsourced.com/post/boltdb-vs-badger>`_
|
||||
- `bolthold <https://github.com/timshannon/bolthold>`_
|
||||
- `badgerhold <https://github.com/timshannon/badgerhold>`_
|
||||
- `Pebble <https://github.com/cockroachdb/pebble>`_
|
||||
- `SDK Issue Regarding IVAL <https://github.com/cosmos/cosmos-sdk/issues/7100>`_
|
||||
- `SDK Discussion about SMT/IVAL <https://github.com/cosmos/cosmos-sdk/discussions/8297>`_
|
||||
|
||||
Discussion
|
||||
----------
|
||||
|
||||
- All things being equal, my tendency would be to use badger, with badgerhold
|
||||
(if that makes sense) for its ergonomics and indexing capabilities, which
|
||||
will require some small selection of wrappers for better write transaction
|
||||
support. This is a weakly held tendency/belief and I think it would be
|
||||
useful for the RFC process to build consensus (or not) around this basic
|
||||
assumption.
|
||||
@@ -1,420 +0,0 @@
|
||||
# RFC 002: Interprocess Communication (IPC) in Tendermint
|
||||
|
||||
## Changelog
|
||||
|
||||
- 08-Sep-2021: Initial draft (@creachadair).
|
||||
|
||||
|
||||
## Abstract
|
||||
|
||||
Communication in Tendermint among consensus nodes, applications, and operator
|
||||
tools all use different message formats and transport mechanisms. In some
|
||||
cases there are multiple options. Having all these options complicates both the
|
||||
code and the developer experience, and hides bugs. To support a more robust,
|
||||
trustworthy, and usable system, we should document which communication paths
|
||||
are essential, which could be removed or reduced in scope, and what we can
|
||||
improve for the most important use cases.
|
||||
|
||||
This document proposes a variety of possible improvements of varying size and
|
||||
scope. Specific design proposals should get their own documentation.
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
The Tendermint state replication engine has a complex IPC footprint.
|
||||
|
||||
1. Consensus nodes communicate with each other using a networked peer-to-peer
|
||||
message-passing protocol.
|
||||
|
||||
2. Consensus nodes communicate with the application whose state is being
|
||||
replicated via the [Application BlockChain Interface (ABCI)][abci].
|
||||
|
||||
3. Consensus nodes export a network-accessible [RPC service][rpc-service] to
|
||||
support operations (bootstrapping, debugging) and synchronization of [light clients][light-client].
|
||||
This interface is also used by the [`tendermint` CLI][tm-cli].
|
||||
|
||||
4. Consensus nodes export a gRPC service exposing a subset of the methods of
|
||||
the RPC service described by (3). This was intended to simplify the
|
||||
implementation of tools that already use gRPC to communicate with an
|
||||
application (via the Cosmos SDK), and wanted to also talk to the consensus
|
||||
node without implementing yet another RPC protocol.
|
||||
|
||||
The gRPC interface to the consensus node has been deprecated and is slated
|
||||
for removal in the forthcoming Tendermint v0.36 release.
|
||||
|
||||
5. Consensus nodes may optionally communicate with a "remote signer" that holds
|
||||
a validator key and can provide public keys and signatures to the consensus
|
||||
node. One of the stated goals of this configuration is to allow the signer
|
||||
to be run on a private network, separate from the consensus node, so that a
|
||||
compromise of the consensus node from the public network would be less
|
||||
likely to expose validator keys.
|
||||
|
||||
## Discussion: Transport Mechanisms
|
||||
|
||||
### Remote Signer Transport
|
||||
|
||||
A remote signer communicates with the consensus node in one of two ways:
|
||||
|
||||
1. "Raw": Using a TCP or Unix-domain socket which carries varint-prefixed
|
||||
protocol buffer messages. In this mode, the consensus node is the server,
|
||||
and the remote signer is the client.
|
||||
|
||||
This mode has been deprecated, and is intended to be removed.
|
||||
|
||||
2. gRPC: This mode uses the same protobuf messages as "Raw" node, but uses a
|
||||
standard encrypted gRPC HTTP/2 stub as the transport. In this mode, the
|
||||
remote signer is the server and the consensus node is the client.
|
||||
|
||||
|
||||
### ABCI Transport
|
||||
|
||||
In ABCI, the _application_ is the server, and the Tendermint consensus engine
|
||||
is the client. Most applications implement the server using the [Cosmos SDK][cosmos-sdk],
|
||||
which handles low-level details of the ABCI interaction and provides a
|
||||
higher-level interface to the rest of the application. The SDK is written in Go.
|
||||
|
||||
Beneath the SDK, the application communicates with Tendermint core in one of
|
||||
two ways:
|
||||
|
||||
- In-process direct calls (for applications written in Go and compiled against
|
||||
the Tendermint code). This is an optimization for the common case where an
|
||||
application is written in Go, to save on the overhead of marshaling and
|
||||
unmarshaling requests and responses within the same process:
|
||||
[`abci/client/local_client.go`][local-client]
|
||||
|
||||
- A custom remote procedure protocol built on wire-format protobuf messages
|
||||
using a socket (the "socket protocol"): [`abci/server/socket_server.go`][socket-server]
|
||||
|
||||
The SDK also provides a [gRPC service][sdk-grpc] accessible from outside the
|
||||
application, allowing transactions to be broadcast to the network, look up
|
||||
transactions, and simulate transaction costs.
|
||||
|
||||
|
||||
### RPC Transport
|
||||
|
||||
The consensus node RPC service allows callers to query consensus parameters
|
||||
(genesis data, transactions, commits), node status (network info, health
|
||||
checks), application state (abci_query, abci_info), mempool state, and other
|
||||
attributes of the node and its application. The service also provides methods
|
||||
allowing transactions and evidence to be injected ("broadcast") into the
|
||||
blockchain.
|
||||
|
||||
The RPC service is exposed in several ways:
|
||||
|
||||
- HTTP GET: Queries may be sent as URI parameters, with method names in the path.
|
||||
|
||||
- HTTP POST: Queries may be sent as JSON-RPC request messages in the body of an
|
||||
HTTP POST request. The server uses a custom implementation of JSON-RPC that
|
||||
is not fully compatible with the [JSON-RPC 2.0 spec][json-rpc], but handles
|
||||
the common cases.
|
||||
|
||||
- Websocket: Queries may be sent as JSON-RPC request messages via a websocket.
|
||||
This transport uses more or less the same JSON-RPC plumbing as the HTTP POST
|
||||
handler.
|
||||
|
||||
The websocket endpoint also includes three methods that are _only_ exported
|
||||
via websocket, which appear to support event subscription.
|
||||
|
||||
- gRPC: A subset of queries may be issued in protocol buffer format to the gRPC
|
||||
interface described above under (4). As noted, this endpoint is deprecated
|
||||
and will be removed in v0.36.
|
||||
|
||||
### Opportunities for Simplification
|
||||
|
||||
**Claim:** There are too many IPC mechanisms.
|
||||
|
||||
The preponderance of ABCI usage is via the Cosmos SDK, which means the
|
||||
application and the consensus node are compiled together into a single binary,
|
||||
and the consensus node calls the ABCI methods of the application directly as Go
|
||||
functions.
|
||||
|
||||
We also need a true IPC transport to support ABCI applications _not_ written in
|
||||
Go. There are also several known applications written in Rust, for example
|
||||
(including [Anoma](https://github.com/anoma/anoma), Penumbra,
|
||||
[Oasis](https://github.com/oasisprotocol/oasis-core), Twilight, and
|
||||
[Nomic](https://github.com/nomic-io/nomic)). Ideally we will have at most one
|
||||
such transport "built-in": More esoteric cases can be handled by a custom proxy.
|
||||
Pragmatically, gRPC is probably the right choice here.
|
||||
|
||||
The primary consumers of the multi-headed "RPC service" today are the light
|
||||
client and the `tendermint` command-line client. There is probably some local
|
||||
use via curl, but I expect that is mostly ad hoc. Ethan reports that nodes are
|
||||
often configured with the ports to the RPC service blocked, which is good for
|
||||
security but complicates use by the light client.
|
||||
|
||||
### Context: Remote Signer Issues
|
||||
|
||||
Since the remote signer needs a secure communication channel to exchange keys
|
||||
and signatures, and is expected to run truly remotely from the node (i.e., on a
|
||||
separate physical server), there is not a whole lot we can do here. We should
|
||||
finish the deprecation and removal of the "raw" socket protocol between the
|
||||
consensus node and remote signers, but the use of gRPC is appropriate.
|
||||
|
||||
The main improvement we can make is to simplify the implementation quite a bit,
|
||||
once we no longer need to support both "raw" and gRPC transports.
|
||||
|
||||
### Context: ABCI Issues
|
||||
|
||||
In the original design of ABCI, the presumption was that all access to the
|
||||
application should be mediated by the consensus node. The idea is that outside
|
||||
access could change application state and corrupt the consensus process, which
|
||||
relies on the application to be deterministic. Of course, even without outside
|
||||
access an application could behave nondeterministically, but allowing other
|
||||
programs to send it requests was seen as courting trouble.
|
||||
|
||||
Conversely, users noted that most of the time, tools written for a particular
|
||||
application don't want to talk to the consensus module directly. The
|
||||
application "owns" the state machine the consensus engine is replicating, so
|
||||
tools that care about application state should talk to the application.
|
||||
Otherwise, they would have to bake in knowledge about Tendermint (e.g., its
|
||||
interfaces and data structures) just because of the mediation.
|
||||
|
||||
For clients to talk directly to the application, however, there is another
|
||||
concern: The consensus node is the ABCI _client_, so it is inconvenient for the
|
||||
application to "push" work into the consensus module via ABCI itself. The
|
||||
current implementation works around this by calling the consensus node's RPC
|
||||
service, which exposes an `ABCIQuery` kitchen-sink method that allows the
|
||||
application a way to poke ABCI messages in the other direction.
|
||||
|
||||
Without this RPC method, you could work around this (at least in principle) by
|
||||
having the consensus module "poll" the application for work that needs done,
|
||||
but that has unsatisfactory implications for performance and robustness, as
|
||||
well as being harder to understand.
|
||||
|
||||
There has apparently been discussion about trying to make a more bidirectional
|
||||
communication between the consensus node and the application, but this issue
|
||||
seems to still be unresolved.
|
||||
|
||||
Another complication of ABCI is that it requires the application (server) to
|
||||
maintain [four separate connections][abci-conn]: One for "consensus" operations
|
||||
(BeginBlock, EndBlock, DeliverTx, Commit), one for "mempool" operations, one
|
||||
for "query" operations, and one for "snapshot" (state synchronization) operations.
|
||||
The rationale seems to have been that these groups of operations should be able
|
||||
to proceed concurrently with each other. In practice, it results in a very complex
|
||||
state management problem to coordinate state updates between the separate streams.
|
||||
While application authors in Go are mostly insulated from that complexity by the
|
||||
Cosmos SDK, the plumbing to maintain those separate streams is complicated, hard
|
||||
to understand, and we suspect it contains concurrency bugs and/or lock contention
|
||||
issues affecting performance that are subtle and difficult to pin down.
|
||||
|
||||
Even without changing the semantics of any ABCI operations, this code could be
|
||||
made smaller and easier to debug by separating the management of concurrency
|
||||
and locking from the IPC transport: If all requests and responses are routed
|
||||
through one connection, the server can explicitly maintain priority queues for
|
||||
requests and responses, and make less-conservative decisions about when locks
|
||||
are (or aren't) required to synchronize state access. With independent queues,
|
||||
the server must lock conservatively, and no optimistic scheduling is practical.
|
||||
|
||||
This would be a tedious implementation change, but should be achievable without
|
||||
breaking any of the existing interfaces. More importantly, it could potentially
|
||||
address a lot of difficult concurrency and performance problems we currently
|
||||
see anecdotally but have difficultly isolating because of how intertwined these
|
||||
separate message streams are at runtime.
|
||||
|
||||
TODO: Impact of ABCI++ for this topic?
|
||||
|
||||
### Context: RPC Issues
|
||||
|
||||
The RPC system serves several masters, and has a complex surface area. I
|
||||
believe there are some improvements that can be exposed by separating some of
|
||||
these concerns.
|
||||
|
||||
The Tendermint light client currently uses the RPC service to look up blocks
|
||||
and transactions, and to forward ABCI queries to the application. The light
|
||||
client proxy uses the RPC service via a websocket. The Cosmos IBC relayer also
|
||||
uses the RPC service via websocket to watch for transaction events, and uses
|
||||
the `ABCIQuery` method to fetch information and proofs for posted transactions.
|
||||
|
||||
Some work is already underway toward using P2P message passing rather than RPC
|
||||
to synchronize light client state with the rest of the network. IBC relaying,
|
||||
however, requires access to the event system, which is currently not accessible
|
||||
except via the RPC interface. Event subscription _could_ be exposed via P2P,
|
||||
but that is a larger project since it adds P2P communication load, and might
|
||||
thus have an impact on the performance of consensus.
|
||||
|
||||
If event subscription can be moved into the P2P network, we could entirely
|
||||
remove the websocket transport, even for clients that still need access to the
|
||||
RPC service. Until then, we may still be able to reduce the scope of the
|
||||
websocket endpoint to _only_ event subscription, by moving uses of the RPC
|
||||
server as a proxy to ABCI over to the gRPC interface.
|
||||
|
||||
Having the RPC server still makes sense for local bootstrapping and operations,
|
||||
but can be further simplified. Here are some specific proposals:
|
||||
|
||||
- Remove the HTTP GET interface entirely.
|
||||
|
||||
- Simplify JSON-RPC plumbing to remove unnecessary reflection and wrapping.
|
||||
|
||||
- Remove the gRPC interface (this is already planned for v0.36).
|
||||
|
||||
- Separate the websocket interface from the rest of the RPC service, and
|
||||
restrict it to only event subscription.
|
||||
|
||||
Eventually we should try to emove the websocket interface entirely, but we
|
||||
will need to revisit that (probably in a new RFC) once we've done some of the
|
||||
easier things.
|
||||
|
||||
These changes would preserve the ability of operators to issue queries with
|
||||
curl (but would require using JSON-RPC instead of URI parameters). That would
|
||||
be a little less user-friendly, but for a use case that should not be that
|
||||
prevalent.
|
||||
|
||||
These changes would also preserve compatibility with existing JSON-RPC based
|
||||
code paths like the `tendermint` CLI and the light client (even ahead of
|
||||
further work to remove that dependency).
|
||||
|
||||
**Design goal:** An operator should be able to disable non-local access to the
|
||||
RPC server on any node in the network without impairing the ability of the
|
||||
network to function for service of state replication, including light clients.
|
||||
|
||||
**Design principle:** All communication required to implement and monitor the
|
||||
consensus network should use P2P, including the various synchronizations.
|
||||
|
||||
### Options for ABCI Transport
|
||||
|
||||
The majority of current usage is in Go, and the majority of that is mediated by
|
||||
the Cosmos SDK, which uses the "direct call" interface. There is probably some
|
||||
opportunity to clean up the implementation of that code, notably by inverting
|
||||
which interface is at the "top" of the abstraction stack (currently it acts
|
||||
like an RPC interface, and escape-hatches into the direct call). However, this
|
||||
general approach works fine and doesn't need to be fundamentally changed.
|
||||
|
||||
For applications _not_ written in Go, the two remaining options are the
|
||||
"socket" protocol (another variation on varint-prefixed protobuf messages over
|
||||
an unstructured stream) and gRPC. It would be nice if we could get rid of one
|
||||
of these to reduce (unneeded?) optionality.
|
||||
|
||||
Since both the socket protocol and gRPC depend on protocol buffers, the
|
||||
"socket" protocol is the most obvious choice to remove. While gRPC is more
|
||||
complex, the set of languages that _have_ protobuf support but _lack_ gRPC
|
||||
support is small. Moreover, gRPC is already widely used in the rest of the
|
||||
ecosystem (including the Cosmos SDK).
|
||||
|
||||
If some use case did arise later that can't work with gRPC, it would not be too
|
||||
difficult for that application author to write a little proxy (in Go) that
|
||||
bridges the convenient SDK APIs into a simpler protocol than gRPC.
|
||||
|
||||
**Design principle:** It is better for an uncommon special case to carry the
|
||||
burdens of its specialness, than to bake an escape hatch into the infrastructure.
|
||||
|
||||
**Recommendation:** We should deprecate and remove the socket protocol.
|
||||
|
||||
### Options for RPC Transport
|
||||
|
||||
[ADR 057][adr-57] proposes using gRPC for the Tendermint RPC implementation.
|
||||
This is still possible, but if we are able to simplify and decouple the
|
||||
concerns as described above, I do not think it should be necessary.
|
||||
|
||||
While JSON-RPC is not the best possible RPC protocol for all situations, it has
|
||||
some advantages over gRPC for our domain. Specifically:
|
||||
|
||||
- It is easy to call JSON-RPC manually from the command-line, which helps with
|
||||
a common concern for the RPC service, local debugging and operations.
|
||||
|
||||
Relatedly: JSON is relatively easy for humans to read and write, and it can
|
||||
be easily copied and pasted to share sample queries and debugging results in
|
||||
chat, issue comments, and so on. Ideally, the RPC service will not be used
|
||||
for activities where the costs of a text protocol are important compared to
|
||||
its legibility and manual usability benefits.
|
||||
|
||||
- gRPC has an enormous dependency footprint for both clients and servers, and
|
||||
many of the features it provides to support security and performance
|
||||
(encryption, compression, streaming, etc.) are mostly irrelevant to local
|
||||
use. Tendermint already needs to include a gRPC client for the remote signer,
|
||||
but if we can avoid the need for a _client_ to depend on gRPC, that is a win
|
||||
for usability.
|
||||
|
||||
- If we intend to migrate light clients off RPC to use P2P entirely, there is
|
||||
no advantage to forcing a temporary migration to gRPC along the way; and once
|
||||
the light client is not dependent on the RPC service, the efficiency of the
|
||||
protocol is much less important.
|
||||
|
||||
- We can still get the benefits of generated data types using protocol buffers, even
|
||||
without using gRPC:
|
||||
|
||||
- Protobuf defines a standard JSON encoding for all message types so
|
||||
languages with protobuf support do not need to worry about type mapping
|
||||
oddities.
|
||||
|
||||
- Using JSON means that even languages _without_ good protobuf support can
|
||||
implement the protocol with a bit more work, and I expect this situation to
|
||||
be rare.
|
||||
|
||||
Even if a language lacks a good standard JSON-RPC mechanism, the protocol is
|
||||
lightweight and can be implemented by simple send/receive over TCP or
|
||||
Unix-domain sockets with no need for code generation, encryption, etc. gRPC
|
||||
uses a complex HTTP/2 based transport that is not easily replicated.
|
||||
|
||||
### Future Work
|
||||
|
||||
The background and proposals sketched above focus on the existing structure of
|
||||
Tendermint and improvements we can make in the short term. It is worthwhile to
|
||||
also consider options for longer-term broader changes to the IPC ecosystem.
|
||||
The following outlines some ideas at a high level:
|
||||
|
||||
- **Consensus service:** Today, the application and the consensus node are
|
||||
nominally connected only via ABCI. Tendermint was originally designed with
|
||||
the assumption that all communication with the application should be mediated
|
||||
by the consensus node. Based on further experience, however, the design goal
|
||||
is now that the _application_ should be the mediator of application state.
|
||||
|
||||
As noted above, however, ABCI is a client/server protocol, with the
|
||||
application as the server. For outside clients that turns out to have been a
|
||||
good choice, but it complicates the relationship between the application and
|
||||
the consensus node: Previously transactions were entered via the node, now
|
||||
they are entered via the app.
|
||||
|
||||
We have worked around this by using the Tendermint RPC service to give the
|
||||
application a "back channel" to the consensus node, so that it can push
|
||||
transactions back into the consensus network. But the RPC service exposes a
|
||||
lot of other functionality, too, including event subscription, block and
|
||||
transaction queries, and a lot of node status information.
|
||||
|
||||
Even if we can't easily "fix" the orientation of the ABCI relationship, we
|
||||
could improve isolation by splitting out the parts of the RPC service that
|
||||
the application needs as a back-channel, and sharing those _only_ with the
|
||||
application. By defining a "consensus service", we could give the application
|
||||
a way to talk back limited to only the capabilities it needs. This approach
|
||||
has the benefit that we could do it without breaking existing use, and if we
|
||||
later did "fix" the ABCI directionality, we could drop the special case
|
||||
without disrupting the rest of the RPC interface.
|
||||
|
||||
- **Event service:** Right now, the IBC relayer relies on the Tendermint RPC
|
||||
service to provide a stream of block and transaction events, which it uses to
|
||||
discover which transactions need relaying to other chains. While I think
|
||||
that event subscription should eventually be handled via P2P, we could gain
|
||||
some immediate benefit by splitting out event subscription from the rest of
|
||||
the RPC service.
|
||||
|
||||
In this model, an event subscription service would be exposed on the public
|
||||
network, but on a different endpoint. This would remove the need for the RPC
|
||||
service to support the websocket protocol, and would allow operators to
|
||||
isolate potentially sensitive status query results from the public network.
|
||||
|
||||
At the moment the relayers also use the RPC service to get block data for
|
||||
synchronization, but work is already in progress to handle that concern via
|
||||
the P2P layer. Once that's done, event subscription could be separated.
|
||||
|
||||
Separating parts of the existing RPC service is not without cost: It might
|
||||
require additional connection endpoints, for example, though it is also not too
|
||||
difficult for multiple otherwise-independent services to share a connection.
|
||||
|
||||
In return, though, it would become easier to reduce transport options and for
|
||||
operators to independently control access to sensitive data. Considering the
|
||||
viability and implications of these ideas is beyond the scope of this RFC, but
|
||||
they are documented here since they follow from the background we have already
|
||||
discussed.
|
||||
|
||||
## References
|
||||
|
||||
[abci]: https://github.com/tendermint/spec/tree/95cf253b6df623066ff7cd4074a94e7a3f147c7a/spec/abci
|
||||
[rpc-service]: https://docs.tendermint.com/master/rpc/
|
||||
[light-client]: https://docs.tendermint.com/master/tendermint-core/light-client.html
|
||||
[tm-cli]: https://github.com/tendermint/tendermint/tree/master/cmd/tendermint
|
||||
[cosmos-sdk]: https://github.com/cosmos/cosmos-sdk/
|
||||
[local-client]: https://github.com/tendermint/tendermint/blob/master/abci/client/local_client.go
|
||||
[socket-server]: https://github.com/tendermint/tendermint/blob/master/abci/server/socket_server.go
|
||||
[sdk-grpc]: https://pkg.go.dev/github.com/cosmos/cosmos-sdk/types/tx#ServiceServer
|
||||
[json-rpc]: https://www.jsonrpc.org/specification
|
||||
[abci-conn]: https://github.com/tendermint/spec/blob/master/spec/abci/apps.md#state
|
||||
[adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md
|
||||
@@ -1,283 +0,0 @@
|
||||
# RFC 003: Taxonomy of potential performance issues in Tendermint
|
||||
|
||||
## Changelog
|
||||
|
||||
- 2021-09-02: Created initial draft (@wbanfield)
|
||||
- 2021-09-14: Add discussion of the event system (@wbanfield)
|
||||
|
||||
## Abstract
|
||||
|
||||
This document discusses the various sources of performance issues in Tendermint and
|
||||
attempts to clarify what work may be required to understand and address them.
|
||||
|
||||
## Background
|
||||
|
||||
Performance, loosely defined as the ability of a software process to perform its work
|
||||
quickly and efficiently under load and within reasonable resource limits, is a frequent
|
||||
topic of discussion in the Tendermint project.
|
||||
To effectively address any issues with Tendermint performance we need to
|
||||
categorize the various issues, understand their potential sources, and gauge their
|
||||
impact on users.
|
||||
|
||||
Categorizing the different known performance issues will allow us to discuss and fix them
|
||||
more systematically. This document proposes a rough taxonomy of performance issues
|
||||
and highlights areas where more research into potential performance problems is required.
|
||||
|
||||
Understanding Tendermint's performance limitations will also be critically important
|
||||
as we make changes to many of its subsystems. Performance is a central concern for
|
||||
upcoming decisions regarding the `p2p` protocol, RPC message encoding and structure,
|
||||
database usage and selection, and consensus protocol updates.
|
||||
|
||||
|
||||
## Discussion
|
||||
|
||||
This section attempts to delineate the different sections of Tendermint functionality
|
||||
that are often cited as having performance issues. It raises questions and suggests
|
||||
lines of inquiry that may be valuable for better understanding Tendermint's performance issues.
|
||||
|
||||
As a note: We should avoid quickly adding many microbenchmarks or package level benchmarks.
|
||||
These are prone to being worse than useless as they can obscure what _should_ be
|
||||
focused on: performance of the system from the perspective of a user. We should,
|
||||
instead, tune performance with an eye towards user needs and actions users make. These users comprise
|
||||
both operators of Tendermint chains and the people generating transactions for
|
||||
Tendermint chains. Both of these sets of users are largely aligned in wanting an end-to-end
|
||||
system that operates quickly and efficiently.
|
||||
|
||||
REQUEST: The list below may be incomplete, if there are additional sections that are often
|
||||
cited as creating poor performance, please comment so that they may be included.
|
||||
|
||||
### P2P
|
||||
|
||||
#### Claim: Tendermint cannot scale to large numbers of nodes
|
||||
|
||||
A complaint has been reported that Tendermint networks cannot scale to large numbers of nodes.
|
||||
The listed number of nodes a user reported as causing issue was in the thousands.
|
||||
We don't currently have evidence about what the upper-limit of nodes that Tendermint's
|
||||
P2P stack can scale to.
|
||||
|
||||
We need to more concretely understand the source of issues and determine what layer
|
||||
is causing a problem. It's possible that the P2P layer, in the absence of any reactors
|
||||
sending data, is perfectly capable of managing thousands of peer connections. For
|
||||
a reasonable networking and application setup, thousands of connections should not present any
|
||||
issue for the application.
|
||||
|
||||
We need more data to understand the problem directly. We want to drive the popularity
|
||||
and adoption of Tendermint and this will mean allowing for chains with more validators.
|
||||
We should follow up with users experiencing this issue. We may then want to add
|
||||
a series of metrics to the P2P layer to better understand the inefficiencies it produces.
|
||||
|
||||
The following metrics can help us understand the sources of latency in the Tendermint P2P stack:
|
||||
|
||||
* Number of messages sent and received per second
|
||||
* Time of a message spent on the P2P layer send and receive queues
|
||||
|
||||
The following metrics exist and should be leveraged in addition to those added:
|
||||
|
||||
* Number of peers node's connected to
|
||||
* Number of bytes per channel sent and received from each peer
|
||||
|
||||
### Sync
|
||||
|
||||
#### Claim: Block Syncing is slow
|
||||
|
||||
Bootstrapping a new node in a network to the height of the rest of the network is believed to
|
||||
take longer than users would like. Block sync requires fetching all of the blocks from
|
||||
peers and placing them into the local disk for storage. A useful line of inquiry
|
||||
is understanding how quickly a perfectly tuned system _could_ fetch all of the state
|
||||
over a network so that we understand how much overhead Tendermint actually adds.
|
||||
|
||||
The operation is likely to be _incredibly_ dependent on the environment in which
|
||||
the node is being run. The factors that will influence syncing include:
|
||||
1. Number of peers that a syncing node may fetch from.
|
||||
2. Speed of the disk that a validator is writing to.
|
||||
3. Speed of the network connection between the different peers that node is
|
||||
syncing from.
|
||||
|
||||
We should calculate how quickly this operation _could possibly_ complete for common chains and nodes.
|
||||
To calculate how quickly this operation could possibly complete, we should assume that
|
||||
a node is reading at line-rate of the NIC and writing at the full drive speed to its
|
||||
local storage. Comparing this theoretical upper-limit to the actual sync times
|
||||
observed by node operators will give us a good point of comparison for understanding
|
||||
how much overhead Tendermint incurs.
|
||||
|
||||
We should additionally add metrics to the blocksync operation to more clearly pinpoint
|
||||
slow operations. The following metrics should be added to the block syncing operation:
|
||||
|
||||
* Time to fetch and validate each block
|
||||
* Time to execute a block
|
||||
* Blocks sync'd per unit time
|
||||
|
||||
### Application
|
||||
|
||||
Applications performing complex state transitions have the potential to bottleneck
|
||||
the Tendermint node.
|
||||
|
||||
#### Claim: ABCI block delivery could cause slowdown
|
||||
|
||||
ABCI delivers blocks in several methods: `BeginBlock`, `DeliverTx`, `EndBlock`, `Commit`.
|
||||
|
||||
Tendermint delivers transactions one-by-one via the `DeliverTx` call. Most of the
|
||||
transaction delivery in Tendermint occurs asynchronously and therefore appears unlikely to
|
||||
form a bottleneck in ABCI.
|
||||
|
||||
After delivering all transactions, Tendermint then calls the `Commit` ABCI method.
|
||||
Tendermint [locks all access to the mempool][abci-commit-description] while `Commit`
|
||||
proceeds. This means that an application that is slow to execute all of its
|
||||
transactions or finalize state during the `Commit` method will prevent any new
|
||||
transactions from being added to the mempool. Apps that are slow to commit will
|
||||
prevent consensus from proceeded to the next consensus height since Tendermint
|
||||
cannot validate block proposals or produce block proposals without the
|
||||
AppHash obtained from the `Commit` method. We should add a metric for each
|
||||
step in the ABCI protocol to track the amount of time that a node spends communicating
|
||||
with the application at each step.
|
||||
|
||||
#### Claim: ABCI serialization overhead causes slowdown
|
||||
|
||||
The most common way to run a Tendermint application is using the Cosmos-SDK.
|
||||
The Cosmos-SDK runs the ABCI application within the same process as Tendermint.
|
||||
When an application is run in the same process as Tendermint, a serialization penalty
|
||||
is not paid. This is because the local ABCI client does not serialize method calls
|
||||
and instead passes the protobuf type through directly. This can be seen
|
||||
in [local_client.go][abci-local-client-code].
|
||||
|
||||
Serialization and deserialization in the gRPC and socket protocol ABCI methods
|
||||
may cause slowdown. While these may cause issue, they are not part of the primary
|
||||
usecase of Tendermint and do not necessarily need to be addressed at this time.
|
||||
|
||||
### RPC
|
||||
|
||||
#### Claim: The Query API is slow.
|
||||
|
||||
The query API locks a mutex across the ABCI connections. This causes consensus to
|
||||
slow during queries, as ABCI is no longer able to make progress. This is known
|
||||
to be causing issue in the cosmos-sdk and is being addressed [in the sdk][sdk-query-fix]
|
||||
but a more robust solution may be required. Adding metrics to each ABCI client connection
|
||||
and message as described in the Application section of this document would allow us
|
||||
to further introspect the issue here.
|
||||
|
||||
#### Claim: RPC Serialization may cause slowdown
|
||||
|
||||
The Tendermint RPC uses a modified version of JSON-RPC. This RPC powers the `broadcast_tx_*` methods,
|
||||
which is a critical method for adding transactions to Tendermint at the moment. This method is
|
||||
likely invoked quite frequently on popular networks. Being able to perform efficiently
|
||||
on this common and critical operation is very important. The current JSON-RPC implementation
|
||||
relies heavily on type introspection via reflection, which is known to be very slow in
|
||||
Go. We should therefore produce benchmarks of this method to determine how much overhead
|
||||
we are adding to what, is likely to be, a very common operation.
|
||||
|
||||
The other JSON-RPC methods are much less critical to the core functionality of Tendermint.
|
||||
While there may other points of performance consideration within the RPC, methods that do not
|
||||
receive high volumes of requests should not be prioritized for performance consideration.
|
||||
|
||||
NOTE: Previous discussion of the RPC framework was done in [ADR 57][adr-57] and
|
||||
there is ongoing work to inspect and alter the JSON-RPC framework in [RFC 002][rfc-002].
|
||||
Much of these RPC-related performance considerations can either wait until the work of RFC 002 work is done or be
|
||||
considered concordantly with the in-flight changes to the JSON-RPC.
|
||||
|
||||
### Protocol
|
||||
|
||||
#### Claim: Gossiping messages is a slow process
|
||||
|
||||
Currently, for any validator to successfully vote in a consensus _step_, it must
|
||||
receive votes from greater than 2/3 of the validators on the network. In many cases,
|
||||
it's preferable to receive as many votes as possible from correct validators.
|
||||
|
||||
This produces a quadratic increase in messages that are communicated as more validators join the network.
|
||||
(Each of the N validators must communicate with all other N-1 validators).
|
||||
|
||||
This large number of messages communicated per step has been identified to impact
|
||||
performance of the protocol. Given that the number of messages communicated has been
|
||||
identified as a bottleneck, it would be extremely valuable to gather data on how long
|
||||
it takes for popular chains with many validators to gather all votes within a step.
|
||||
|
||||
Metrics that would improve visibility into this include:
|
||||
|
||||
* Amount of time for a node to gather votes in a step.
|
||||
* Amount of time for a node to gather all block parts.
|
||||
* Number of votes each node sends to gossip (i.e. not its own votes, but votes it is
|
||||
transmitting for a peer).
|
||||
* Total number of votes each node sends to receives (A node may receive duplicate votes
|
||||
so understanding how frequently this occurs will be valuable in evaluating the performance
|
||||
of the gossip system).
|
||||
|
||||
#### Claim: Hashing Txs causes slowdown in Tendermint
|
||||
|
||||
Using a faster hash algorithm for Tx hashes is currently a point of discussion
|
||||
in Tendermint. Namely, it is being considered as part of the [modular hashing proposal][modular-hashing].
|
||||
It is currently unknown if hashing transactions in the Mempool forms a significant bottleneck.
|
||||
Although it does not appear to be documented as slow, there are a few open github
|
||||
issues that indicate a possible user preference for a faster hashing algorithm,
|
||||
including [issue 2187][issue-2187] and [issue 2186][issue-2186].
|
||||
|
||||
It is likely worth investigating what order of magnitude Tx hashing takes in comparison to other
|
||||
aspects of adding a Tx to the mempool. It is not currently clear if the rate of adding Tx
|
||||
to the mempool is a source of user pain. We should not endeavor to make large changes to
|
||||
consensus critical components without first being certain that the change is highly
|
||||
valuable and impactful.
|
||||
|
||||
### Digital Signatures
|
||||
|
||||
#### Claim: Verification of digital signatures may cause slowdown in Tendermint
|
||||
|
||||
Working with cryptographic signatures can be computationally expensive. The cosmos
|
||||
hub uses [ed25519 signatures][hub-signature]. The library performing signature
|
||||
verification in Tendermint on votes is [benchmarked][ed25519-bench] to be able to perform an `ed25519`
|
||||
signature in 75μs on a decently fast CPU. A validator in the Cosmos Hub performs
|
||||
3 sets of verifications on the signatures of the 140 validators in the Hub
|
||||
in a consensus round, during block verification, when verifying the prevotes, and
|
||||
when verifying the precommits. With no batching, this would be roughly `3ms` per
|
||||
round. It is quite unlikely, therefore, that this accounts for any serious amount
|
||||
of the ~7 seconds of block time per height in the Hub.
|
||||
|
||||
This may cause slowdown when syncing, since the process needs to constantly verify
|
||||
signatures. It's possible that improved signature aggregation will lead to improved
|
||||
light client or other syncing performance. In general, a metric should be added
|
||||
to track block rate while blocksyncing.
|
||||
|
||||
#### Claim: Our use of digital signatures in the consensus protocol contributes to performance issue
|
||||
|
||||
Currently, Tendermint's digital signature verification requires that all validators
|
||||
receive all vote messages. Each validator must receive the complete digital signature
|
||||
along with the vote message that it corresponds to. This means that all N validators
|
||||
must receive messages from at least 2/3 of the N validators in each consensus
|
||||
round. Given the potential for oddly shaped network topologies and the expected
|
||||
variable network roundtrip times of a few hundred milliseconds in a blockchain,
|
||||
it is highly likely that this amount of gossiping is leading to a significant amount
|
||||
of the slowdown in the Cosmos Hub and in Tendermint consensus.
|
||||
|
||||
### Tendermint Event System
|
||||
|
||||
#### Claim: The event system is a bottleneck in Tendermint
|
||||
|
||||
The Tendermint Event system is used to communicate and store information about
|
||||
internal Tendermint execution. The system uses channels internally to send messages
|
||||
to different subscribers. Sending an event [blocks on the internal channel][event-send].
|
||||
The default configuration is to [use an unbuffered channel for event publishes][event-buffer-capacity].
|
||||
Several consumers of the event system also use an unbuffered channel for reads.
|
||||
An example of this is the [event indexer][event-indexer-unbuffered], which takes an
|
||||
unbuffered subscription to the event system. The result is that these unbuffered readers
|
||||
can cause writes to the event system to block or slow down depending on contention in the
|
||||
event system. This has implications for the consensus system, which [publishes events][consensus-event-send].
|
||||
To better understand the performance of the event system, we should add metrics to track the timing of
|
||||
event sends. The following metrics would be a good start for tracking this performance:
|
||||
|
||||
* Time in event send, labeled by Event Type
|
||||
* Time in event receive, labeled by subscriber
|
||||
* Event throughput, measured in events per unit time.
|
||||
|
||||
### References
|
||||
[modular-hashing]: https://github.com/tendermint/tendermint/pull/6773
|
||||
[issue-2186]: https://github.com/tendermint/tendermint/issues/2186
|
||||
[issue-2187]: https://github.com/tendermint/tendermint/issues/2187
|
||||
[rfc-002]: https://github.com/tendermint/tendermint/pull/6913
|
||||
[adr-57]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-057-RPC.md
|
||||
[issue-1319]: https://github.com/tendermint/tendermint/issues/1319
|
||||
[abci-commit-description]: https://github.com/tendermint/spec/blob/master/spec/abci/apps.md#commit
|
||||
[abci-local-client-code]: https://github.com/tendermint/tendermint/blob/511bd3eb7f037855a793a27ff4c53c12f085b570/abci/client/local_client.go#L84
|
||||
[hub-signature]: https://github.com/cosmos/gaia/blob/0ecb6ed8a244d835807f1ced49217d54a9ca2070/docs/resources/genesis.md#consensus-parameters
|
||||
[ed25519-bench]: https://github.com/oasisprotocol/curve25519-voi/blob/d2e7fc59fe38c18ca990c84c4186cba2cc45b1f9/PERFORMANCE.md
|
||||
[event-send]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/libs/pubsub/pubsub.go#L338
|
||||
[event-buffer-capacity]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/types/event_bus.go#L14
|
||||
[event-indexer-unbuffered]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/state/indexer/indexer_service.go#L39
|
||||
[consensus-event-send]: https://github.com/tendermint/tendermint/blob/5bd3b286a2b715737f6d6c33051b69061d38f8ef/internal/consensus/state.go#L1573
|
||||
[sdk-query-fix]: https://github.com/cosmos/cosmos-sdk/pull/10045
|
||||
@@ -1,213 +0,0 @@
|
||||
========================================
|
||||
RFC 004: E2E Test Framework Enhancements
|
||||
========================================
|
||||
|
||||
Changelog
|
||||
---------
|
||||
|
||||
- 2021-09-14: started initial draft (@tychoish)
|
||||
|
||||
Abstract
|
||||
--------
|
||||
|
||||
This document discusses a series of improvements to the e2e test framework
|
||||
that we can consider during the next few releases to help boost confidence in
|
||||
Tendermint releases, and improve developer efficiency.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
During the 0.35 release cycle, the E2E tests were a source of great
|
||||
value, helping to identify a number of bugs before release. At the same time,
|
||||
the tests were not consistently passing during this time, thereby reducing
|
||||
their value, and forcing the core development team to allocate time and energy
|
||||
to maintaining and chasing down issues with the e2e tests and the test
|
||||
harness. The experience of this release cycle calls to mind a series of
|
||||
improvements to the test framework, and this document attempts to capture
|
||||
these improvements, along with motivations, and potential for impact.
|
||||
|
||||
Projects
|
||||
--------
|
||||
|
||||
Flexible Workload Generation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Presently the e2e suite contains a single workload generation pattern, which
|
||||
exists simply to ensure that the test networks have some work during their
|
||||
runs. However, the shape and volume of the work is very consistent and is very
|
||||
gentle to help ensure test reliability.
|
||||
|
||||
We don't need a complex workload generation framework, but being able to have
|
||||
a few different workload shapes available for test networks, both generated and
|
||||
hand-crafted, would be useful.
|
||||
|
||||
Workload patterns/configurations might include:
|
||||
|
||||
- transaction targeting patterns (include light nodes, round robin, target
|
||||
individual nodes)
|
||||
|
||||
- variable transaction size over time.
|
||||
|
||||
- transaction broadcast option (synchronously, checked, fire-and-forget,
|
||||
mixed).
|
||||
|
||||
- number of transactions to submit.
|
||||
|
||||
- non-transaction workloads: (evidence submission, query, event subscription.)
|
||||
|
||||
Configurable Generator
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The nightly e2e suite is defined by the `testnet generator
|
||||
<https://github.com/tendermint/tendermint/blob/master/test/e2e/generator/generate.go#L13-L65>`_,
|
||||
and it's difficult to add dimensions or change the focus of the test suite in
|
||||
any way without modifying the implementation of the generator. If the
|
||||
generator were more configurable, potentially via a file rather than in
|
||||
the Go implementation, we could modify the focus of the test suite on the
|
||||
fly.
|
||||
|
||||
Features that we might want to configure:
|
||||
|
||||
- number of test networks to generate of various topologies, to improve
|
||||
coverage of different configurations.
|
||||
|
||||
- test application configurations (to modify the latency of ABCI calls, etc.)
|
||||
|
||||
- size of test networks.
|
||||
|
||||
- workload shape and behavior.
|
||||
|
||||
- initial sync and catch-up configurations.
|
||||
|
||||
The workload generator currently provides runtime options for limiting the
|
||||
generator to specific types of P2P stacks, and for generating multiple groups
|
||||
of test cases to support parallelism. The goal is to extend this pattern and
|
||||
avoid hardcoding the matrix of test cases in the generator code. Once the
|
||||
testnet configuration generation behavior is configurable at runtime,
|
||||
developers may be able to use the e2e framework to validate changes before
|
||||
landing changes that break e2e tests a day later.
|
||||
|
||||
In addition to the autogenerated suite, it might make sense to maintain a
|
||||
small collection of hand-crafted cases that exercise configurations of
|
||||
concern, to run as part of the nightly (or less frequent) loop.
|
||||
|
||||
Implementation Plan Structure
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As a development team, we should determine the features should impact the e2e
|
||||
testing early in the development cycle, and if we intend to modify the e2e
|
||||
tests to exercise a feature, we should identify this early and begin the
|
||||
integration process as early as possible.
|
||||
|
||||
To facilitate this, we should adopt a practice whereby we exercise specific
|
||||
features that are currently under development more rigorously in the e2e
|
||||
suite, and then as development stabilizes we can reduce the number or weight
|
||||
of these features in the suite.
|
||||
|
||||
As of 0.35 there are essentially two end to end tests: the suite of 64
|
||||
generated test networks, and the hand crafted `ci.toml` test case. The
|
||||
generated test cases help provide systemtic coverage, while the `ci` run
|
||||
provides coverage for a large number of features.
|
||||
|
||||
Reduce Cycle Time
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
One of the barriers to leveraging the e2e framework, and one of the challenges
|
||||
in debugging failures, is the cycle time of running a single test iteration is
|
||||
quite high: 5 minutes to build the docker image, plus the time to run the test
|
||||
or tests.
|
||||
|
||||
There are a number of improvements and enhancements that can reduce the cycle
|
||||
time in practice:
|
||||
|
||||
- reduce the amount of time required to build the docker image used in these
|
||||
tests. Without the dependency on CGo, the tendermint binaries could be
|
||||
(cross) compiled outside of the docker container and then injected into
|
||||
them, which would take better advantage of docker's native caching,
|
||||
although, without the dependency on CGo there would be no hard requirement
|
||||
for the e2e tests to use docker.
|
||||
|
||||
- support test parallelism. Because of the way the testnets are orchestrated
|
||||
a single system can really only run one network at a time. For executions
|
||||
(local or remote) with more resources, there's no reason to run a few
|
||||
networks in parallel to reduce the feedback time.
|
||||
|
||||
- prune testnet configurations that are unlikely to provide good signal, to
|
||||
shorten the time to feedback.
|
||||
|
||||
- apply some kind of tiered approach to test execution, to improve the
|
||||
legibility of the test result. For example order tests by the dependency of
|
||||
their features, or run test networks without perturbations before running
|
||||
that configuration with perturbations, to be able to isolate the impact of
|
||||
specific features.
|
||||
|
||||
- orchestrate the test harness directly from go test rather than via a special
|
||||
harness and shell scripts so e2e tests may more naively fit into developers
|
||||
existing workflows.
|
||||
|
||||
Many of these improvements, particularly, reducing the build time will also
|
||||
reduce the time to get feedback during automated builds.
|
||||
|
||||
Deeper Insights
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
When a test network fails, it's incredibly difficult to understand _why_ the
|
||||
network failed, as the current system provides very little insight into the
|
||||
system outside of the process logs. When a test network stalls or fails
|
||||
developers should be able to quickly and easily get a sense of the state of
|
||||
the network and all nodes.
|
||||
|
||||
Improvements in persuit of this goal, include functionality that would help
|
||||
node operators in production environments by improving the quality and utility
|
||||
of the logging messages and other reported metrics, but also provide some
|
||||
tools to collect and aggregate this data for developers in the context of test
|
||||
networks.
|
||||
|
||||
- Interleave messages from all nodes in the network to be able to correlate
|
||||
events during the test run.
|
||||
|
||||
- Collect structured metrics of the system operation (CPU/MEM/IO) during the
|
||||
test run, as well as from each tendermint/application process.
|
||||
|
||||
- Build (simple) tools to be able to render and summarize the data collected
|
||||
during the test run to answer basic questions about test outcome.
|
||||
|
||||
Flexible Assertions
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Currently, all assertions run for every test network, which makes the
|
||||
assertions pretty bland, and the framework primarily useful as a smoke-test
|
||||
framework, but it might be useful to be able to write and run different
|
||||
tests for different configurations. This could allow us to test outside of the
|
||||
happy-path.
|
||||
|
||||
In general our existing assertions occupy a fraction of the total test time,
|
||||
so the relative cost of adding a few extra test assertions would be of limited
|
||||
cost, and could help build confidence.
|
||||
|
||||
Additional Kinds of Testing
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The existing e2e suite, exercises networks of nodes that have homogeneous
|
||||
tendermint version, stable configuration, that are expected to make
|
||||
progress. There are many other possible test configurations that may be
|
||||
interesting to engage with. These could include dimensions, such as:
|
||||
|
||||
- Multi-version testing to exercise our compatibility guarantees for networks
|
||||
that might have different tendermint versions.
|
||||
|
||||
- As a flavor or mult-version testing, include upgrade testing, to build
|
||||
confidence in migration code and procedures.
|
||||
|
||||
- Additional test applications, particularly practical-type applciations
|
||||
including some that use gaiad and/or the cosmos-sdk. Test-only applications
|
||||
that simulate other kinds of applications (e.g. variable application
|
||||
operation latency.)
|
||||
|
||||
- Tests of "non-viable" configurations that ensure that forbidden combinations
|
||||
lead to halts.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
- `ADR 66: End-to-End Testing <../architecture/adr-66-e2e-testing.md>`_
|
||||
@@ -1,122 +0,0 @@
|
||||
=====================
|
||||
RFC 005: Event System
|
||||
=====================
|
||||
|
||||
Changelog
|
||||
---------
|
||||
|
||||
- 2021-09-17: Initial Draft (@tychoish)
|
||||
|
||||
Abstract
|
||||
--------
|
||||
|
||||
The event system within Tendermint, which supports a lot of core
|
||||
functionality, also represents a major infrastructural liability. As part of
|
||||
our upcoming review of the RPC interfaces and our ongoing thoughts about
|
||||
stability and performance, as well as the preparation for Tendermint 1.0, we
|
||||
should revisit the design and implementation of the event system. This
|
||||
document discusses both the current state of the system and potential
|
||||
directions for future improvement.
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
Current State of Events
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The event system makes it possible for clients, both internal and external,
|
||||
to receive notifications of state replication events, such as new blocks,
|
||||
new transactions, validator set changes, as well as intermediate events during
|
||||
consensus. Because the event system is very cross cutting, the behavior and
|
||||
performance of the event publication and subscription system has huge impacts
|
||||
for all of Tendermint.
|
||||
|
||||
The subscription service is exposed over the RPC interface, but also powers
|
||||
the indexing (e.g. to an external database,) and is the mechanism by which
|
||||
`BroadcastTxCommit` is able to wait for transactions to land in a block.
|
||||
|
||||
The current pubsub mechanism relies on a couple of buffered channels,
|
||||
primarily between all event creators and subscribers, but also for each
|
||||
subscription. The result of this design is that, in some situations with the
|
||||
right collection of slow subscription consumers the event system can put
|
||||
backpressure on the consensus state machine and message gossiping in the
|
||||
network, thereby causing nodes to lag.
|
||||
|
||||
Improvements
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The current system relies on implicit, bounded queues built by the buffered channels,
|
||||
and though threadsafe, can force all activity within Tendermint to serialize,
|
||||
which does not need to happen. Additionally, timeouts for subscription
|
||||
consumers related to the implementation of the RPC layer, may complicate the
|
||||
use of the system.
|
||||
|
||||
References
|
||||
~~~~~~~~~~
|
||||
|
||||
- Legacy Implementation
|
||||
- `publication of events <https://github.com/tendermint/tendermint/blob/master/libs/pubsub/pubsub.go#L333-L345>`_
|
||||
- `send operation <https://github.com/tendermint/tendermint/blob/master/libs/pubsub/pubsub.go#L489-L527>`_
|
||||
- `send loop <https://github.com/tendermint/tendermint/blob/master/libs/pubsub/pubsub.go#L381-L402>`_
|
||||
- Related RFCs
|
||||
- `RFC 002: IPC Ecosystem <./rfc-002-ipc-ecosystem.md>`_
|
||||
- `RFC 003: Performance Questions <./rfc-003-performance-questions.md>`_
|
||||
|
||||
Discussion
|
||||
----------
|
||||
|
||||
Changes to Published Events
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As part of this process, the Tendermint team should do a study of the existing
|
||||
event types and ensure that there are viable production use cases for
|
||||
subscriptions to all event types. Instinctively it seems plausible that some
|
||||
of the events may not be useable outside of tendermint, (e.g. ``TimeoutWait``
|
||||
or ``NewRoundStep``) and it might make sense to remove them. Certainly, it
|
||||
would be good to make sure that we don't maintain infrastructure for unused or
|
||||
un-useful message indefinitely.
|
||||
|
||||
Blocking Subscription
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The blocking subscription mechanism makes it possible to have *send*
|
||||
operations into the subscription channel be un-buffered (the event processing
|
||||
channel is still buffered.) In the blocking case, events from one subscription
|
||||
can block processing that event for other non-blocking subscriptions. The main
|
||||
case, it seems for blocking subscriptions is ensuring that a transaction has
|
||||
been committed to a block for ``BroadcastTxCommit``. Removing blocking
|
||||
subscriptions entirely, and potentially finding another way to implement
|
||||
``BroadcastTxCommit``, could lead to important simplifications and
|
||||
improvements to throughput without requiring large changes.
|
||||
|
||||
Subscription Identification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Before `#6386 <https://github.com/tendermint/tendermint/pull/6386>`_, all
|
||||
subscriptions were identified by the combination of a client ID and a query,
|
||||
and with that change, it became possible to identify all subscription given
|
||||
only an ID, but compatibility with the legacy identification means that there's a
|
||||
good deal of legacy code as well as client side efficiency that could be
|
||||
improved.
|
||||
|
||||
Pubsub Changes
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
The pubsub core should be implemented in a way that removes the possibility of
|
||||
backpressure from the event system to impact the core system *or* for one
|
||||
subscription to impact the behavior of another area of the
|
||||
system. Additionally, because the current system is implemented entirely in
|
||||
terms of a collection of buffered channels, the event system (and large
|
||||
numbers of subscriptions) can be a source of memory pressure.
|
||||
|
||||
These changes could include:
|
||||
|
||||
- explicit cancellation and timeouts promulgated from callers (e.g. RPC end
|
||||
points, etc,) this should be done using contexts.
|
||||
|
||||
- subscription system should be able to spill to disk to avoid putting memory
|
||||
pressure on the core behavior of the node (consensus, gossip).
|
||||
|
||||
- subscriptions implemented as cursors rather than channels, with either
|
||||
condition variables to simulate the existing "push" API or a client side
|
||||
iterator API with some kind of long polling-type interface.
|
||||
@@ -1,35 +0,0 @@
|
||||
# RFC {RFC-NUMBER}: {TITLE}
|
||||
|
||||
## Changelog
|
||||
|
||||
- {date}: {changelog}
|
||||
|
||||
## Abstract
|
||||
|
||||
> A brief high-level synopsis of the topic of discussion for this RFC, ideally
|
||||
> just a few sentences. This should help the reader quickly decide whether the
|
||||
> rest of the discussion is relevant to their interest.
|
||||
|
||||
## Background
|
||||
|
||||
> Any context or orientation needed for a reader to understand and participate
|
||||
> in the substance of the Discussion. If necessary, this section may include
|
||||
> links to other documentation or sources rather than restating existing
|
||||
> material, but should provide enough detail that the reader can tell what they
|
||||
> need to read to be up-to-date.
|
||||
|
||||
### References
|
||||
|
||||
> Links to external materials needed to follow the discussion may be added here.
|
||||
>
|
||||
> In addition, if the discussion in a request for comments leads to any design
|
||||
> decisions, it may be helpful to add links to the ADR documents here after the
|
||||
> discussion has settled.
|
||||
|
||||
## Discussion
|
||||
|
||||
> This section contains the core of the discussion.
|
||||
>
|
||||
> There is no fixed format for this section, but ideally changes to this
|
||||
> section should be updated before merging to reflect any discussion that took
|
||||
> place on the PR that made those changes.
|
||||
Reference in New Issue
Block a user