docs/rfc: add testnet RFC (#9124)

* docs/rfc: add testnet RFC

Following several discussions internal to the Tendermint engineering
team, I am posting an RFC discussing the high-level details of the
Tendermint team owning and operating a long-lived testnet in order to
build experience running Tendermint, and to demonstrate that Tendermint
is stable under production workloads.

The outcome of this RFC will be a new track of work to begin building
and maintaining a testnet associated with the main branch of tendermint.
See the "Testnet MVP" section specifically for some of the first
milestones.

Note, I added the RFC where it would live once #9115 is merged to
restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will
need to be updated to include this RFC once #9115 is merged.

This RFC is related to #9078.

* docs/rfc: minor updates to testnet rfc

* docs/rfc: respond to more feedback on testnet RFC

* docs/rfc: add RFC 023 to rfc index
This commit is contained in:
Mark Rushakoff
2022-08-01 11:33:04 -04:00
committed by GitHub
parent 07d242c461
commit b6a515a818
2 changed files with 264 additions and 0 deletions

View File

@@ -59,5 +59,6 @@ sections.
- [RFC-019: Configuration File Versioning](./rfc-019-config-version.md)
- [RFC-020: Onboarding Projects](./rfc-020-onboarding-projects.rst)
- [RFC-021: The Future of the Socket Protocol](./rfc-021-socket-protocol.md)
- [RFC-023: Semi-permanent Testnet](./rfc-023-semi-permanent-testnet.md)
<!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->

View File

@@ -0,0 +1,263 @@
# RFC 023: Semi-permanent Testnet
## Changelog
- 2022-07-28: Initial draft (@mark-rushakoff)
- 2022-07-29: Renumber to 023, minor clarifications (@mark-rushakoff)
## Abstract
This RFC discusses a long-lived testnet, owned and operated by the Tendermint engineers.
By owning and operating a production-like testnet,
the team who develops Tendermint becomes more capable of discovering bugs that
only arise in production-like environments.
They also build expertise in operating Tendermint;
this will help guide the development of Tendermint towards operator-friendly design.
The RFC details a rough roadmap towards a semi-permanent testnet, some of the considered tradeoffs,
and the expected outcomes from following this roadmap.
## Background
The author's understanding -- which is limited as a new contributor to the Tendermint project --
is that Tendermint development has been largely treated as a library for other projects to consume.
Of course effort has been spent on unit tests, end-to-end tests, and integration tests.
But whether developing a library or an application,
there is no substitute for putting the software under a production-like load.
First, there are classes of bugs that are unrealistic to discover in environments
that do not resemble production.
But perhaps more importantly, there are "operational features" that are best designed
by the authors of a given piece of software.
For instance, does the software have sufficient observability built-in?
Are the reported metrics useful?
Are the log messages clear and sufficiently detailed, without being too noisy?
Furthermore, if the library authors are not only building --
but also maintaining and operating -- an application built on top of their library,
the authors will have a greatly increased confidence that their library's API
is appropriate for other application authors.
Once the decision has been made to run and operate a service,
one of the next strategic questions is that of deploying said service.
The author strongly holds the opinion that, when possible,
a continuous delivery model offers the most compelling set of advantages:
- The code on a particular branch (likely `main` or `master`) is exactly what is,
or what will very soon be, running in production
- There are no manual steps involved in deploying -- other than merging your pull request,
which you had to do anyway
- A bug discovered in production can be rapidly confirmed as fixed in production
In summary, if the tendermint authors build, maintain, and continuously deliver an application
intended to serve as a long-lived testnet, they will be able to state with confidence:
- We operate the software in a production-like environment and we have observed it to be
stable and performant to our requirements
- We have discovered issues in production before any external parties have consumed our software,
and we have addressed said issues
- We have successfully used the observability tooling built into our software
(perhaps in conjunction with other off-the-shelf tooling)
to diagnose and debug issues in production
## Discussion
The Discussion Section proposes a variety of aspects of maintaining a testnet for Tendermint.
### Number of testnets
There should probably be one testnet per maintained branch of Tendermint,
i.e. one for the `main` branch
and one per `v0.N.x` branch that the authors maintain.
There may also exist testnets for long-lived feature branches.
We may eventually discover that there is good reason to run more than one testnet for a branch,
perhaps due to a significant configuration variation.
### Testnet lifecycle
The document has used the terms "long-lived" and "semi-permanent" somewhat interchangeably.
The intent of the testnet being discussed in this RFC is to exist indefinitely;
but there is a practical understanding that there will be testnet instances
which will be retired due to a variety of reasons.
For instance, once a release branch is no longer supported,
its corresponding testnet should be torn down.
In general, new commits to branches with corresponding testnets
should result in an in-place upgrade of all nodes in the testnet
without any data loss and without requiring new configuration.
The mechanism for achieving this is outside the scope of this RFC.
However, it is also expected that there will be
breaking changes during the development of the `main` branch.
For instance, suppose there is an unreleased feature involving storage on disk,
and the developers need to change the storage format.
It should be at the developers' discretion whether it is feasible and worthwhile
to introduce an intermediate commit that translates the old format to the new format,
or if it would be preferable to just destroy the testnet and start from scratch
without any data in the old format.
Similarly, if a developer inadvertently pushed a breaking change to an unreleased feature,
they are free to make a judgement call between reverting the change,
adding a commit to allow a forward migration,
or simply forcing the testnet to recreate.
### Testnet maintenance investment
While there is certainly engineering effort required to build the tooling and infrastructure
to get the testnets up and running,
the intent is that a running testnet requires no manual upkeep under normal conditions.
It is expected that a subset of the Tendermint engineers are familiar with and engaged in
writing the software to maintain and build the testnet infrastructure,
but the rest of the team should not need any involvement in authoring that code.
The testnets should be configured to send notifications for events requiring triage,
such as a chain halt or a node OOMing.
The time investment necessary to address the underlying issues for those kind of events
is unpredictable.
Aside from triaging exceptional events, an engineer may choose to spend some time
collecting metrics or profiles from testnet nodes to check performance details
before and after a particular change;
or they may inspect logs associated with an expected behavior change.
But during day-to-day work, engineers are not expected to spend any considerable time
directly interacting with the testnets.
If we discover that there are any routine actions engineers must take against the testnet
that take any substantial focused time,
those actions should be automated to a one-line command as much as is reasonable.
### Testnet MVP
The minimum viable testnet meets this set of features:
- The testnet self-updates following a new commit pushed to Tendermint's `main` branch on GitHub
(there are some omitted steps here, such as CI building appropriate binaries and
somehow notifying the testnet that a new build is available)
- The testnet runs the Tendermint KV store for MVP
- The testnet operators are notified if:
- Any node's process exits for any reason other than a restart for a new binary
- Any node stops updating blocks, and by extension if a chain halt occurs
- No other observability will be considered for MVP
- The testnet has a minimum of 1 full node and 3 validators
- The testnet has a reasonably low, constant throughput of transactions -- say 30 tx/min --
and the testnet operators are notified if that throughput drops below 75% of target
sustained over 5 minutes
- The testnet only needs to run in a single datacenter/cloud-region for MVP,
i.e. running in multiple datacenters is out of scope for MVP
- The testnet is running directly on VMs or compute instances;
while Kubernetes or other orchestration frameworks may offer many significant advantages,
the Tendermint engineers should not be required to learn those tools in order to
perform basic debugging
### Testnet medium-term goals
The medium-term goals are intended to be achievable within the 6-12 month time range
following the launch of MVP.
These goals could realistically be roadmapped following the launch of the MVP testnet.
- The `main` testnet has more than 20 nodes (completely arbitrary -- 5x more than 1+3 at MVP)
- In addition to the `main` testnet,
there is at least one testnet associated with one release branch
- The testnet no longer is simply running the Tendermint KV store;
now it is built on a more complex, custom application
that deliberately exercises a greater portion of the Tendermint stack
- Each testnet is spread across at least two cloud providers,
in order to communicate over a network more closely resembling use of Tendermint in "real" chains
- The node updates have some "jitter",
with some nodes updating immediately when a new build is available,
and others delaying up to perhaps 30-60 minutes
- The team has published some form of dashboards that have served well for debugging,
which external parties can copy/modify to their needs
- The dashboards must include metrics published by Tendermint nodes;
there should be both OS- or runtime-level metrics such as memory in use,
and application-level metrics related to the underlying blockchain
- "Published" in this context is more in the spirit of "shared with the community",
not "produced a supported open source tool" --
this could be published to GitHub with a warning that no support is offered,
or it could simply be a blog post detailing what has worked for the Tendermint developers
- The dashboards will likely be implemented on free and open source tooling,
but that is not a hard requirement if paid software is more appropriate
- The team has produced a reference model of a log aggregation stack that external parties can use
- Similar to the "published" dashboards, this only needs to be "shared" rather than "supported"
- Chaos engineering has begun being integrated into the testnets
(this could be periodic CPU limiting or deliberate network interference, etc.
but it probably would not be filesystem corruption)
- Each testnet has at least one node running a build with the Go race detector enabled
- The testnet contains some kind of generalized notification system built in:
- Tendermint code grows "watchdog" systems built in to validate things like
subsystems have not deadlocked; e.g. if the watchdog can't acquire and immediately release
a particular mutex once in every 5-minute period, it is near certain that the target
subsystem has deadlocked, and an alert must be sent to the engineering team.
(Outside of the testnet, the watchdogs could be disabled, or they could panic on failure.)
- The notification system does some deduplication to minimize spam on system failure
### Testnet long-term vision
The long-term vision includes goals that are not necessary for short- or medium-term success,
but which would support building an increasingly stable and performant product.
These goals would generally be beyond the one-year plan,
and therefore they would not be part of initial planning.
- There is a centralized dashboard to get a quick overview of all testnets,
or at least one centralized dashboard per testnet,
showing TBD basic information
- Testnets include cloud spot instances which periodically and abruptly join and leave the network
- The testnets are a heterogeneous mixture of straight VMs and Docker containers,
thereby more closely representing production blockchains
- Testnets have some manner of continuous profiling,
so that we can produce an apples-to-apples comparison of CPU/memory cost of particular operations
### Testnet non-goals
There are some things we are explicitly not trying to achieve with long-lived testnets:
- The Tendermint engineers will NOT be responsible for the testnets' availability
outside of working hours; there will not be any kind of on-call schedule
- As a result of the 8x5 support noted in the previous point,
there will be NO guarantee of uptime or availability for any testnet
- The testnets will NOT be used to gate pull requests;
that responsibility belongs to unit tests, end-to-end tests, and integration tests
- Similarly, the testnet will NOT be used to automate any changes back into Tendermint source code;
we will not automatically create a revert commit due to a failed rollout, for instance
- The testnets are NOT intended to have participation from machines outside of the
Tendermint engineering team's control, as the Tendermint engineers are expected
to have full access to any instance where they may need to debug an issue
- While there will certainly be individuals within the Tendermint engineering team
who will continue to build out their individual "devops" skills to produce
the infrastructure for the testnet, it is NOT a goal that every Tendermint engineer
is even _familiar_ with the tech stack involved, whether it is Ansible, Terraform,
Kubernetes, etc.
As a rule of thumb, all engineers should be able to get shell access on any given instance
and should have access to the instance's logs.
Little if any further operational skills will be expected.
- The testnets are not intended to be _created_ for one-off experiments.
While there is nothing wrong with an engineer directly interacting with a testnet
to try something out,
a testnet comes with a considerable amount of "baggage", so end-to-end or integration tests
are closer to the intent for "trying something to see what happens".
Direct interaction should be limited to standard blockchain operations,
_not_ modifying configuration of nodes.
- Likewise, the purpose of the testnet is not to run specific "tests" per se,
but rather to demonstrate that Tendermint blockchains as a whole are stable
under a production load.
Of course we will inject faults periodically, but the intent is to observe and prove that
the testnet is resilient to those faults.
It would be the responsibility of a lower-level test to demonstrate e.g.
that the network continues when a single validator disappears without warning.
- The testnet descriptions in this document are scoped only to building directly on Tendermint;
integrating with the Cosmos SDK, or any other third-party library, is out of scope
### Team outcomes as a result of maintaining and operating a testnet
Finally, this section reiterates what team growth we expect by running semi-permanent testnets.
- Confidence that Tendermint is stable under a particular production-like load
- Familiarity with typical production behavior of Tendermint, e.g. what the logs look like,
what the memory footprint looks like, and what kind of throughput is reasonable
for a network of a particular size
- Comfort and familiarity in manually inspecting a misbehaving or failing node
- Confidence that Tendermint ships sufficient tooling for external users
to operate their nodes
- Confidence that Tendermint exposes useful metrics, and comfort interpreting those metrics
- Produce useful reference documentation that gives operators confidence to run Tendermint nodes