mirror of
https://github.com/tendermint/tendermint.git
synced 2025-12-23 14:25:19 +00:00
Signed-off-by: 117503445 <t117503445@gmail.com> Signed-off-by: 117503445 <t117503445@gmail.com> Co-authored-by: Thane Thomson <connect@thanethomson.com>
241 lines
11 KiB
ReStructuredText
241 lines
11 KiB
ReStructuredText
=======================================
|
|
RFC 020: Tendermint Onboarding Projects
|
|
=======================================
|
|
|
|
.. contents::
|
|
:backlinks: none
|
|
|
|
Changelog
|
|
---------
|
|
|
|
- 2022-03-30: Initial draft. (@tychoish)
|
|
- 2022-04-25: Imported document to tendermint repository. (@tychoish)
|
|
|
|
Overview
|
|
--------
|
|
|
|
This document describes a collection of projects that might be good for new
|
|
engineers joining the Tendermint Core team. These projects mostly describe
|
|
features that we'd be very excited to see land in the code base, but that are
|
|
intentionally outside of the critical path of a release on the roadmap, and
|
|
have the following properties that we think make good on-boarding projects:
|
|
|
|
- require relatively little context for the project or its history beyond a
|
|
more isolated area of the code.
|
|
|
|
- provide exposure to different areas of the codebase, so new team members
|
|
will have reason to explore the code base, build relationships with people
|
|
on the team, and gain experience with more than one area of the system.
|
|
|
|
- be of moderate size, striking a healthy balance between trivial or
|
|
mechanical changes (which provide little insight) and large intractable
|
|
changes that require deeper insight than is available during onboarding to
|
|
address well. A good size project should have natural touchpoints or
|
|
check-ins.
|
|
|
|
Projects
|
|
--------
|
|
|
|
Before diving into one of these projects, have a conversation about the
|
|
project or aspects of Tendermint that you're excited to work on with your
|
|
onboarding buddy. This will help make sure that these issues are still
|
|
relevant, help you get any context, underatnding known pitfalls, and to
|
|
confirm a high level approach or design (if relevant.) On-boarding buddies
|
|
should be prepared to do some design work before someone joins the team.
|
|
|
|
The descriptions that follow provide some basic background and attempt to
|
|
describe the user stories and the potential impact of these project.
|
|
|
|
E2E Test Systems
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Tendermint's E2E framework makes it possible to run small test networks with
|
|
different Tendermint configurations, and make sure that the system works. The
|
|
tests run Tendermint in a separate binary, and the system provides some very
|
|
high level protection against making changes that could break Tendermint in
|
|
otherwise difficult to detect ways.
|
|
|
|
Working on the E2E system is a good place to get introduced to the Tendermint
|
|
codebase, particularly for developers who are newer to Go, as the E2E
|
|
system (generator, runner, etc.) is distinct from the rest of Tendermint and
|
|
comparatively quite small, so it may be easier to begin making changes in this
|
|
area. At the same time, because the E2E system exercises *all* of Tendermint,
|
|
work in this area is a good way to get introduced to various components of the
|
|
system.
|
|
|
|
Configurable E2E Workloads
|
|
++++++++++++++++++++++++++
|
|
|
|
All E2E tests use the same workload (e.g. generated transactions, submitted to
|
|
different nodes in the network,) which has been tuned empirically to provide a
|
|
gentle but consistent parallel load that all E2E tests can pass. Ideally, the
|
|
workload generator could be configurable to have different shapes of work
|
|
(bursty, different transaction sizes, weighted to different nodes, etc.) and
|
|
even perhaps further parameterized within a basic shape, which would make it
|
|
possible to use our existing test infrastructure to answer different questions
|
|
about the performance or capability of the system.
|
|
|
|
The work would involve adding a new parameter to the E2E test manifest, and
|
|
creating an option (e.g. "legacy") for the current load generation model,
|
|
extract configurations options for the current load generation, and then
|
|
prototype implementations of alternate load generation, and also run some
|
|
preliminary using the tools.
|
|
|
|
Byzantine E2E Workloads
|
|
+++++++++++++++++++++++
|
|
|
|
There are two main kinds of integration tests in Tendermint: the E2E test
|
|
framework, and then a collection of integration tests that masquerade as
|
|
unit-tests. While some of this expansion of test scope is (potentially)
|
|
inevitable, the masquerading unit tests (e.g ``consensus.byzantine_test.go``)
|
|
end up being difficult to understand, difficult to maintain, and unreliable.
|
|
|
|
One solution to this, would be to modify the E2E ABCI application to allow it
|
|
to inject byzantine behavior, and then have this be a configurable aspect of
|
|
a test network to be able to provoke Byzantine behavior in a "real" system and
|
|
then observe that evidence is constructed. This would make it possible to
|
|
remove the legacy tests entirely once the new tests have proven themselves.
|
|
|
|
Abstract Orchestration Framework
|
|
++++++++++++++++++++++++++++++++
|
|
|
|
The orchestration of e2e test processes is presently done using docker
|
|
compose, which works well, but has proven a bit limiting as all processes need
|
|
to run on a single machine, and the log aggregation functions are confusing at
|
|
best.
|
|
|
|
This project would replace the current orchestration with something more
|
|
generic, potentially maintaining the current system, but also allowing the e2e
|
|
tests to manage processes using k8s. There are a few "local" k8s frameworks
|
|
(e.g. kind and k3s,) which might be able to be useful for our current testing
|
|
model, but hopefully, we could use this new implementation with other k8s
|
|
systems for more flexible distribute test orchestration.
|
|
|
|
Improve Operationalize Experience of ``run-multiple.sh``
|
|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
|
|
The e2e test runner currently runs a single test, and in most cases we manage
|
|
the test cases using a shell script that ensure cleanup of entire test
|
|
suites. This is a bit difficult to maintain and makes reproduction of test
|
|
cases more awkward than it should be. The e2e ``runner`` itself should provide
|
|
equivalent functionality to ``run-multiple.sh``: ensure cleanup of test cases,
|
|
collect and process output, and be able to manage entire suites of cases.
|
|
|
|
It might also be useful to implement an e2e test orchestrator that runs all
|
|
tendermint instances in a single process, using "real" networks for faster
|
|
feedback and iteration during development.
|
|
|
|
In addition to being a bit easier to maintain, having a more capable runner
|
|
implementation would make it easier to collect data from test runs, improve
|
|
debugability and reporting.
|
|
|
|
Fan-Out For CI E2E Tests
|
|
++++++++++++++++++++++++
|
|
|
|
While there are some parallelism in the execution of e2e tests, each e2e test
|
|
job must build a tendermint e2e image, which takes about 5 minutes of CPU time
|
|
per-task, which given the size of each of the runs.
|
|
|
|
We'd like to be able to reduce the amount of overhead per-e2e tests while
|
|
keeping the cycle time for working with the tests very low, while also
|
|
maintaining a reasonable level of test coverage. This is an impossible
|
|
tradeoff, in some ways, and the percentage of overhead at the moment is large
|
|
enough that we can make some material progress with a moderate amount of time.
|
|
|
|
Most of this work has to do with modifying github actions configuration and
|
|
e2e artifact (docker) building to reduce redundant work. Eventually, when we
|
|
can drop the requirement for CGo storage engines, it will be possible to move
|
|
(cross) compile tendermint locally, and then inject the binary into the docker
|
|
container, which would reduce a lot of the build-time complexity, although we
|
|
can move more in this direction or have runtime flags to disable CGo
|
|
dependencies for local development.
|
|
|
|
Remove Panics
|
|
~~~~~~~~~~~~~
|
|
|
|
There are lots of places in the code base which can panic, and would not be
|
|
particularly well handled. While in some cases, panics are the right answer,
|
|
in many cases the panics were just added to simplify downstream error
|
|
checking, and could easily be converted to errors.
|
|
|
|
The `Don't Panic RFC
|
|
<./rfc-008-do-not-panic.md>`_
|
|
covers some of the background and approach.
|
|
|
|
While the changes are in this project are relatively rote, this will provide
|
|
exposure to lots of different areas of the codebase as well as insight into
|
|
how different areas of the codebase interact with eachother, as well as
|
|
experience with the test suites and infrastructure.
|
|
|
|
Implement more Expressive ABCI Applications
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Tendermint maintains two very simple ABCI applications (a KV application used
|
|
for basic testing, and slightly more advanced test application used in the
|
|
end-to-end tests). Writing an application would provide a new engineer with
|
|
useful experiences using Tendermint that mirrors the expierence of downstream
|
|
users.
|
|
|
|
This is more of an exploratory project, but could include providing common
|
|
interfaces on top of Tendermint consensus for other well known protocols or
|
|
tools (e.g. ``etcd``) or a DNS server or some other tool.
|
|
|
|
Self-Regulating Reactors
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Currently reactors (the internal processes that are responsible for the higher
|
|
level behavior of Tendermint) can be started and stopped, but have no
|
|
provision for being paused. These additional semantics may allow Tendermint to
|
|
pause reactors (and avoid processing their messhages, etc.) and allow better
|
|
coordination in the future.
|
|
|
|
While this is a big project, it's possible to break this apart into many
|
|
smaller projects: make p2p channels pauseable, add pause/UN-pause hooks to the
|
|
service implementation and machinery, and finally to modify the reactor
|
|
implementations to take advantage of these additional semantics
|
|
|
|
This project would give an engineer some exposure to the p2p layer of the
|
|
code, as well as to various aspects of the reactor implementations.
|
|
|
|
Metrics
|
|
~~~~~~~
|
|
|
|
Tendermint has a metrics system that is relatively underutilized, and figuring
|
|
out ways to capture and organize the metrics to provide value to users might
|
|
provide an interesting set of projects for new engineers on Tendermint.
|
|
|
|
Convert Logs to Metrics
|
|
+++++++++++++++++++++++
|
|
|
|
Because the tendermint logs tend to be quite verbose and not particularly
|
|
actionable, most users largely ignore the logging or run at very low
|
|
verbosity. While the log statements in the code do describe useful events,
|
|
taken as a whole the system is not particularly tractable, and particularly at
|
|
the Debug level, not useful. One solution to this problem is to identify log
|
|
messages that might be (e.g. increment a counter for certian kinds of errors)
|
|
|
|
One approach might be to look at various logging statements, particularly
|
|
debug statements or errors that are logged but not returned, and see if
|
|
they're convertable to counters or other metrics.
|
|
|
|
Expose Metrics to Tests
|
|
+++++++++++++++++++++++
|
|
|
|
The existing Tendermint test suites replace the metrics infrastructure with
|
|
no-op implementations, which means that tests can neither verify that metrics
|
|
are ever recorded, nor can tests use metrics to observe events in the
|
|
system. Writing an implementation, for testing, that makes it possible to
|
|
record metrics and provides an API for introspecting this data, as well as
|
|
potentially writing tests that take advantage of this type, could be useful.
|
|
|
|
Logging Metrics
|
|
+++++++++++++++
|
|
|
|
In some systems, the logging system itself can provide some interesting
|
|
insights for operators: having metrics that track the number of messages at
|
|
different levels as well as the total number of messages, can act as a canary
|
|
for the system as a whole.
|
|
|
|
This should be achievable by adding an interceptor layer within the logging
|
|
package itself that can add metrics to the existing system.
|