rfc: e2e improvements (#6941)

2026-08-02 05:16:10 +00:00 · 2021-09-15 15:26:39 -04:00
parent e932b469ed
commit f08f72e334
2 changed files with 214 additions and 0 deletions
@@ -40,5 +40,6 @@ sections.
 - [RFC-000: P2P Roadmap](./rfc-000-p2p-roadmap.rst)
 - [RFC-001: Storage Engines](./rfc-001-storage-engine.rst)
 - [RFC-002: Interprocess Communication](./rfc-002-ipc-ecosystem.md)
+- [RFC-004: E2E Test Framework Enhancements](./rfc-004-e2e-framework.md)

 <!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
@@ -0,0 +1,213 @@
+========================================
+RFC 004: E2E Test Framework Enhancements
+========================================
+
+Changelog
+---------
+
+- 2021-09-14: started initial draft (@tychoish)
+
+Abstract
+--------
+
+This document discusses a series of improvements to the e2e test framework
+that we can consider during the next few releases to help boost confidence in
+Tendermint releases, and improve developer efficiency.
+
+Background
+----------
+
+During the 0.35 release cycle, the E2E tests were a source of great
+value, helping to identify a number of bugs before release. At the same time,
+the tests were not consistently passing during this time, thereby reducing
+their value, and forcing the core development team to allocate time and energy
+to maintaining and chasing down issues with the e2e tests and the test
+harness. The experience of this release cycle calls to mind a series of
+improvements to the test framework, and this document attempts to capture
+these improvements, along with motivations, and potential for impact.
+
+Projects
+--------
+
+Flexible Workload Generation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Presently the e2e suite contains a single workload generation pattern, which
+exists simply to ensure that the test networks have some work during their
+runs. However, the shape and volume of the work is very consistent and is very
+gentle to help ensure test reliability.
+
+We don't need a complex workload generation framework, but being able to have 
+a few different workload shapes available for test networks, both generated and
+hand-crafted, would be useful.
+
+Workload patterns/configurations might include:
+
+- transaction targeting patterns (include light nodes, round robin, target
+  individual nodes)
+
+- variable transaction size over time.
+
+- transaction broadcast option (synchronously, checked, fire-and-forget,
+  mixed).
+
+- number of transactions to submit.
+
+- non-transaction workloads: (evidence submission, query, event subscription.)
+
+Configurable Generator
+~~~~~~~~~~~~~~~~~~~~~~
+
+The nightly e2e suite is defined by the `testnet generator
+<https://github.com/tendermint/tendermint/blob/master/test/e2e/generator/generate.go#L13-L65>`_,
+and it's difficult to add dimensions or change the focus of the test suite in
+any way without modifying the implementation of the generator. If the
+generator were more configurable, potentially via a file rather than in
+the Go implementation, we could modify the focus of the test suite on the
+fly.
+
+Features that we might want to configure:
+
+- number of test networks to generate of various topologies, to improve
+  coverage of different configurations.
+
+- test application configurations (to modify the latency of ABCI calls, etc.)
+
+- size of test networks.
+
+- workload shape and behavior.
+
+- initial sync and catch-up configurations.
+
+The workload generator currently provides runtime options for limiting the
+generator to specific types of P2P stacks, and for generating multiple groups
+of test cases to support parallelism. The goal is to extend this pattern and
+avoid hardcoding the matrix of test cases in the generator code.  Once the
+testnet configuration generation behavior is configurable at runtime,
+developers may be able to use the e2e framework to validate changes before
+landing changes that break e2e tests a day later.
+
+In addition to the autogenerated suite, it might make sense to maintain a
+small collection of hand-crafted cases that exercise configurations of
+concern, to run as part of the nightly (or less frequent) loop.
+
+Implementation Plan Structure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As a development team, we should determine the features should impact the e2e
+testing early in the development cycle, and if we intend to modify the e2e
+tests to exercise a feature, we should identify this early and begin the
+integration process as early as possible.
+
+To facilitate this, we should adopt a practice whereby we exercise specific
+features that are currently under development more rigorously in the e2e
+suite, and then as development stabilizes we can reduce the number or weight
+of these features in the suite.
+
+As of 0.35 there are essentially two end to end tests: the suite of 64
+generated test networks, and the hand crafted `ci.toml` test case. The
+generated test cases help provide systemtic coverage, while the `ci` run 
+provides coverage for a large number of features. 
+
+Reduce Cycle Time
+~~~~~~~~~~~~~~~~~
+
+One of the barriers to leveraging the e2e framework, and one of the challenges
+in debugging failures, is the cycle time of running a single test iteration is
+quite high: 5 minutes to build the docker image, plus the time to run the test
+or tests.
+
+There are a number of improvements and enhancements that can reduce the cycle
+time in practice:
+
+- reduce the amount of time required to build the docker image used in these
+  tests. Without the dependency on CGo, the tendermint binaries could be
+  (cross) compiled outside of the docker container and then injected into
+  them, which would take better advantage of docker's native caching,
+  although, without the dependency on CGo there would be no hard requirement
+  for the e2e tests to use docker.
+
+- support test parallelism. Because of the way the testnets are orchestrated
+  a single system can really only run one network at a time. For executions
+  (local or remote) with more resources, there's no reason to run a few
+  networks in parallel to reduce the feedback time.
+
+- prune testnet configurations that are unlikely to provide good signal, to
+  shorten the time to feedback.
+
+- apply some kind of tiered approach to test execution, to improve the
+  legibility of the test result. For example order tests by the dependency of
+  their features, or run test networks without perturbations before running
+  that configuration with perturbations, to be able to isolate the impact of
+  specific features.
+
+- orchestrate the test harness directly from go test rather than via a special
+  harness and shell scripts so e2e tests may more naively fit into developers
+  existing workflows.
+
+Many of these improvements, particularly, reducing the build time will also
+reduce the time to get feedback during automated builds.
+
+Deeper Insights
+~~~~~~~~~~~~~~~
+
+When a test network fails, it's incredibly difficult to understand _why_ the
+network failed, as the current system provides very little insight into the
+system outside of the process logs. When a test network stalls or fails
+developers should be able to quickly and easily get a sense of the state of
+the network and all nodes.
+
+Improvements in persuit of this goal, include functionality that would help
+node operators in production environments by improving the quality and utility
+of the logging messages and other reported metrics, but also provide some
+tools to collect and aggregate this data for developers in the context of test
+networks.
+
+- Interleave messages from all nodes in the network to be able to correlate
+  events during the test run.
+
+- Collect structured metrics of the system operation (CPU/MEM/IO) during the
+  test run, as well as from each tendermint/application process.
+
+- Build (simple) tools to be able to render and summarize the data collected
+  during the test run to answer basic questions about test outcome.
+
+Flexible Assertions
+~~~~~~~~~~~~~~~~~~~
+
+Currently, all assertions run for every test network, which makes the
+assertions pretty bland, and the framework primarily useful as a smoke-test
+framework, but it might be useful to be able to write and run different
+tests for different configurations. This could allow us to test outside of the
+happy-path.
+
+In general our existing assertions occupy a fraction of the total test time,
+so the relative cost of adding a few extra test assertions would be of limited
+cost, and could help build confidence.
+
+Additional Kinds of Testing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The existing e2e suite, exercises networks of nodes that have homogeneous
+tendermint version, stable configuration, that are expected to make
+progress. There are many other possible test configurations that may be
+interesting to engage with. These could include dimensions, such as:
+
+- Multi-version testing to exercise our compatibility guarantees for networks
+  that might have different tendermint versions.
+
+- As a flavor or mult-version testing, include upgrade testing, to build
+  confidence in migration code and procedures.
+
+- Additional test applications, particularly practical-type applciations
+  including some that use gaiad and/or the cosmos-sdk. Test-only applications
+  that simulate other kinds of applications (e.g. variable application
+  operation latency.)
+
+- Tests of "non-viable" configurations that ensure that forbidden combinations
+  lead to halts.
+
+References
+----------
+
+- `ADR 66: End-to-End Testing <../architecture/adr-66-e2e-testing.md>`_