From beb9ff2ae3a1c20ff365627aeb9d553a4f2db2fd Mon Sep 17 00:00:00 2001 From: Mark Rushakoff Date: Mon, 1 Aug 2022 11:33:04 -0400 Subject: [PATCH] docs/rfc: add testnet RFC (#9124) * docs/rfc: add testnet RFC Following several discussions internal to the Tendermint engineering team, I am posting an RFC discussing the high-level details of the Tendermint team owning and operating a long-lived testnet in order to build experience running Tendermint, and to demonstrate that Tendermint is stable under production workloads. The outcome of this RFC will be a new track of work to begin building and maintaining a testnet associated with the main branch of tendermint. See the "Testnet MVP" section specifically for some of the first milestones. Note, I added the RFC where it would live once #9115 is merged to restore the RFC layout from the v0.36.x branch. docs/rfc/README.md will need to be updated to include this RFC once #9115 is merged. This RFC is related to #9078. * docs/rfc: minor updates to testnet rfc * docs/rfc: respond to more feedback on testnet RFC * docs/rfc: add RFC 023 to rfc index --- docs/rfc/README.md | 1 + docs/rfc/rfc-023-semi-permanent-testnet.md | 263 +++++++++++++++++++++ 2 files changed, 264 insertions(+) create mode 100644 docs/rfc/rfc-023-semi-permanent-testnet.md diff --git a/docs/rfc/README.md b/docs/rfc/README.md index c1461e887..346318c7b 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -59,5 +59,6 @@ sections. - [RFC-019: Configuration File Versioning](./rfc-019-config-version.md) - [RFC-020: Onboarding Projects](./rfc-020-onboarding-projects.rst) - [RFC-021: The Future of the Socket Protocol](./rfc-021-socket-protocol.md) +- [RFC-023: Semi-permanent Testnet](./rfc-023-semi-permanent-testnet.md) diff --git a/docs/rfc/rfc-023-semi-permanent-testnet.md b/docs/rfc/rfc-023-semi-permanent-testnet.md new file mode 100644 index 000000000..ddb31a908 --- /dev/null +++ b/docs/rfc/rfc-023-semi-permanent-testnet.md @@ -0,0 +1,263 @@ +# RFC 023: Semi-permanent Testnet + +## Changelog + +- 2022-07-28: Initial draft (@mark-rushakoff) +- 2022-07-29: Renumber to 023, minor clarifications (@mark-rushakoff) + +## Abstract + +This RFC discusses a long-lived testnet, owned and operated by the Tendermint engineers. +By owning and operating a production-like testnet, +the team who develops Tendermint becomes more capable of discovering bugs that +only arise in production-like environments. +They also build expertise in operating Tendermint; +this will help guide the development of Tendermint towards operator-friendly design. + +The RFC details a rough roadmap towards a semi-permanent testnet, some of the considered tradeoffs, +and the expected outcomes from following this roadmap. + +## Background + +The author's understanding -- which is limited as a new contributor to the Tendermint project -- +is that Tendermint development has been largely treated as a library for other projects to consume. +Of course effort has been spent on unit tests, end-to-end tests, and integration tests. +But whether developing a library or an application, +there is no substitute for putting the software under a production-like load. + +First, there are classes of bugs that are unrealistic to discover in environments +that do not resemble production. +But perhaps more importantly, there are "operational features" that are best designed +by the authors of a given piece of software. +For instance, does the software have sufficient observability built-in? +Are the reported metrics useful? +Are the log messages clear and sufficiently detailed, without being too noisy? + +Furthermore, if the library authors are not only building -- +but also maintaining and operating -- an application built on top of their library, +the authors will have a greatly increased confidence that their library's API +is appropriate for other application authors. + +Once the decision has been made to run and operate a service, +one of the next strategic questions is that of deploying said service. +The author strongly holds the opinion that, when possible, +a continuous delivery model offers the most compelling set of advantages: +- The code on a particular branch (likely `main` or `master`) is exactly what is, + or what will very soon be, running in production +- There are no manual steps involved in deploying -- other than merging your pull request, + which you had to do anyway +- A bug discovered in production can be rapidly confirmed as fixed in production + +In summary, if the tendermint authors build, maintain, and continuously deliver an application +intended to serve as a long-lived testnet, they will be able to state with confidence: +- We operate the software in a production-like environment and we have observed it to be + stable and performant to our requirements +- We have discovered issues in production before any external parties have consumed our software, + and we have addressed said issues +- We have successfully used the observability tooling built into our software + (perhaps in conjunction with other off-the-shelf tooling) + to diagnose and debug issues in production + +## Discussion + +The Discussion Section proposes a variety of aspects of maintaining a testnet for Tendermint. + +### Number of testnets + +There should probably be one testnet per maintained branch of Tendermint, +i.e. one for the `main` branch +and one per `v0.N.x` branch that the authors maintain. + +There may also exist testnets for long-lived feature branches. + +We may eventually discover that there is good reason to run more than one testnet for a branch, +perhaps due to a significant configuration variation. + +### Testnet lifecycle + +The document has used the terms "long-lived" and "semi-permanent" somewhat interchangeably. +The intent of the testnet being discussed in this RFC is to exist indefinitely; +but there is a practical understanding that there will be testnet instances +which will be retired due to a variety of reasons. +For instance, once a release branch is no longer supported, +its corresponding testnet should be torn down. + +In general, new commits to branches with corresponding testnets +should result in an in-place upgrade of all nodes in the testnet +without any data loss and without requiring new configuration. +The mechanism for achieving this is outside the scope of this RFC. + +However, it is also expected that there will be +breaking changes during the development of the `main` branch. +For instance, suppose there is an unreleased feature involving storage on disk, +and the developers need to change the storage format. +It should be at the developers' discretion whether it is feasible and worthwhile +to introduce an intermediate commit that translates the old format to the new format, +or if it would be preferable to just destroy the testnet and start from scratch +without any data in the old format. + +Similarly, if a developer inadvertently pushed a breaking change to an unreleased feature, +they are free to make a judgement call between reverting the change, +adding a commit to allow a forward migration, +or simply forcing the testnet to recreate. + +### Testnet maintenance investment + +While there is certainly engineering effort required to build the tooling and infrastructure +to get the testnets up and running, +the intent is that a running testnet requires no manual upkeep under normal conditions. + +It is expected that a subset of the Tendermint engineers are familiar with and engaged in +writing the software to maintain and build the testnet infrastructure, +but the rest of the team should not need any involvement in authoring that code. + +The testnets should be configured to send notifications for events requiring triage, +such as a chain halt or a node OOMing. +The time investment necessary to address the underlying issues for those kind of events +is unpredictable. + +Aside from triaging exceptional events, an engineer may choose to spend some time +collecting metrics or profiles from testnet nodes to check performance details +before and after a particular change; +or they may inspect logs associated with an expected behavior change. +But during day-to-day work, engineers are not expected to spend any considerable time +directly interacting with the testnets. + +If we discover that there are any routine actions engineers must take against the testnet +that take any substantial focused time, +those actions should be automated to a one-line command as much as is reasonable. + +### Testnet MVP + +The minimum viable testnet meets this set of features: + +- The testnet self-updates following a new commit pushed to Tendermint's `main` branch on GitHub + (there are some omitted steps here, such as CI building appropriate binaries and + somehow notifying the testnet that a new build is available) +- The testnet runs the Tendermint KV store for MVP +- The testnet operators are notified if: + - Any node's process exits for any reason other than a restart for a new binary + - Any node stops updating blocks, and by extension if a chain halt occurs + - No other observability will be considered for MVP +- The testnet has a minimum of 1 full node and 3 validators +- The testnet has a reasonably low, constant throughput of transactions -- say 30 tx/min -- + and the testnet operators are notified if that throughput drops below 75% of target + sustained over 5 minutes +- The testnet only needs to run in a single datacenter/cloud-region for MVP, + i.e. running in multiple datacenters is out of scope for MVP +- The testnet is running directly on VMs or compute instances; + while Kubernetes or other orchestration frameworks may offer many significant advantages, + the Tendermint engineers should not be required to learn those tools in order to + perform basic debugging + +### Testnet medium-term goals + +The medium-term goals are intended to be achievable within the 6-12 month time range +following the launch of MVP. +These goals could realistically be roadmapped following the launch of the MVP testnet. + +- The `main` testnet has more than 20 nodes (completely arbitrary -- 5x more than 1+3 at MVP) +- In addition to the `main` testnet, + there is at least one testnet associated with one release branch +- The testnet no longer is simply running the Tendermint KV store; + now it is built on a more complex, custom application + that deliberately exercises a greater portion of the Tendermint stack +- Each testnet is spread across at least two cloud providers, + in order to communicate over a network more closely resembling use of Tendermint in "real" chains +- The node updates have some "jitter", + with some nodes updating immediately when a new build is available, + and others delaying up to perhaps 30-60 minutes +- The team has published some form of dashboards that have served well for debugging, + which external parties can copy/modify to their needs + - The dashboards must include metrics published by Tendermint nodes; + there should be both OS- or runtime-level metrics such as memory in use, + and application-level metrics related to the underlying blockchain + - "Published" in this context is more in the spirit of "shared with the community", + not "produced a supported open source tool" -- + this could be published to GitHub with a warning that no support is offered, + or it could simply be a blog post detailing what has worked for the Tendermint developers + - The dashboards will likely be implemented on free and open source tooling, + but that is not a hard requirement if paid software is more appropriate +- The team has produced a reference model of a log aggregation stack that external parties can use + - Similar to the "published" dashboards, this only needs to be "shared" rather than "supported" +- Chaos engineering has begun being integrated into the testnets + (this could be periodic CPU limiting or deliberate network interference, etc. + but it probably would not be filesystem corruption) +- Each testnet has at least one node running a build with the Go race detector enabled +- The testnet contains some kind of generalized notification system built in: + - Tendermint code grows "watchdog" systems built in to validate things like + subsystems have not deadlocked; e.g. if the watchdog can't acquire and immediately release + a particular mutex once in every 5-minute period, it is near certain that the target + subsystem has deadlocked, and an alert must be sent to the engineering team. + (Outside of the testnet, the watchdogs could be disabled, or they could panic on failure.) + - The notification system does some deduplication to minimize spam on system failure + +### Testnet long-term vision + +The long-term vision includes goals that are not necessary for short- or medium-term success, +but which would support building an increasingly stable and performant product. +These goals would generally be beyond the one-year plan, +and therefore they would not be part of initial planning. + +- There is a centralized dashboard to get a quick overview of all testnets, + or at least one centralized dashboard per testnet, + showing TBD basic information +- Testnets include cloud spot instances which periodically and abruptly join and leave the network +- The testnets are a heterogeneous mixture of straight VMs and Docker containers, + thereby more closely representing production blockchains +- Testnets have some manner of continuous profiling, + so that we can produce an apples-to-apples comparison of CPU/memory cost of particular operations + +### Testnet non-goals + +There are some things we are explicitly not trying to achieve with long-lived testnets: + +- The Tendermint engineers will NOT be responsible for the testnets' availability + outside of working hours; there will not be any kind of on-call schedule +- As a result of the 8x5 support noted in the previous point, + there will be NO guarantee of uptime or availability for any testnet +- The testnets will NOT be used to gate pull requests; + that responsibility belongs to unit tests, end-to-end tests, and integration tests +- Similarly, the testnet will NOT be used to automate any changes back into Tendermint source code; + we will not automatically create a revert commit due to a failed rollout, for instance +- The testnets are NOT intended to have participation from machines outside of the + Tendermint engineering team's control, as the Tendermint engineers are expected + to have full access to any instance where they may need to debug an issue +- While there will certainly be individuals within the Tendermint engineering team + who will continue to build out their individual "devops" skills to produce + the infrastructure for the testnet, it is NOT a goal that every Tendermint engineer + is even _familiar_ with the tech stack involved, whether it is Ansible, Terraform, + Kubernetes, etc. + As a rule of thumb, all engineers should be able to get shell access on any given instance + and should have access to the instance's logs. + Little if any further operational skills will be expected. +- The testnets are not intended to be _created_ for one-off experiments. + While there is nothing wrong with an engineer directly interacting with a testnet + to try something out, + a testnet comes with a considerable amount of "baggage", so end-to-end or integration tests + are closer to the intent for "trying something to see what happens". + Direct interaction should be limited to standard blockchain operations, + _not_ modifying configuration of nodes. +- Likewise, the purpose of the testnet is not to run specific "tests" per se, + but rather to demonstrate that Tendermint blockchains as a whole are stable + under a production load. + Of course we will inject faults periodically, but the intent is to observe and prove that + the testnet is resilient to those faults. + It would be the responsibility of a lower-level test to demonstrate e.g. + that the network continues when a single validator disappears without warning. +- The testnet descriptions in this document are scoped only to building directly on Tendermint; + integrating with the Cosmos SDK, or any other third-party library, is out of scope + +### Team outcomes as a result of maintaining and operating a testnet + +Finally, this section reiterates what team growth we expect by running semi-permanent testnets. + +- Confidence that Tendermint is stable under a particular production-like load +- Familiarity with typical production behavior of Tendermint, e.g. what the logs look like, + what the memory footprint looks like, and what kind of throughput is reasonable + for a network of a particular size +- Comfort and familiarity in manually inspecting a misbehaving or failing node +- Confidence that Tendermint ships sufficient tooling for external users + to operate their nodes +- Confidence that Tendermint exposes useful metrics, and comfort interpreting those metrics +- Produce useful reference documentation that gives operators confidence to run Tendermint nodes