mirror of
https://github.com/tendermint/tendermint.git
synced 2026-01-05 04:55:18 +00:00
rfc: onboarding projects (#8413)
This is meant as a supporting recruiting document. The idea is to describe a bunch of projects scoped and selected as teaching projects for new engineers joining the team. This isn't meant to replace "neweng" or "good-first-ticket" tags on issues, but provide a higher level set of examples of the kinds of things that someone joining the team could tackle.
This commit is contained in:
@@ -57,5 +57,6 @@ sections.
|
||||
- [RFC-017: ABCI++ Vote Extension Propagation](./rfc-017-abci++-vote-extension-propag.md)
|
||||
- [RFC-018: BLS Signature Aggregation Exploration](./rfc-018-bls-agg-exploration.md)
|
||||
- [RFC-019: Configuration File Versioning](./rfc-019-config-version.md)
|
||||
- [RFC-020: Onboarding Projects](./rfc-020-onboarding-projects.rst)
|
||||
|
||||
<!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
|
||||
|
||||
240
docs/rfc/rfc-020-onboarding-projects.rst
Normal file
240
docs/rfc/rfc-020-onboarding-projects.rst
Normal file
@@ -0,0 +1,240 @@
|
||||
=======================================
|
||||
RFC 020: Tendermint Onboarding Projects
|
||||
=======================================
|
||||
|
||||
.. contents::
|
||||
:backlinks: none
|
||||
|
||||
Changelog
|
||||
---------
|
||||
|
||||
- 2022-03-30: Initial draft. (@tychoish)
|
||||
- 2022-04-25: Imported document to tendermint repository. (@tychoish)
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
This document describes a collection of projects that might be good for new
|
||||
engineers joining the Tendermint Core team. These projects mostly describe
|
||||
features that we'd be very excited to see land in the code base, but that are
|
||||
intentionally outside of the critical path of a release on the roadmap, and
|
||||
have the following properties that we think make good on-boarding projects:
|
||||
|
||||
- require relatively little context for the project or its history beyond a
|
||||
more isolated area of the code.
|
||||
|
||||
- provide exposure to different areas of the codebase, so new team members
|
||||
will have reason to explore the code base, build relationships with people
|
||||
on the team, and gain experience with more than one area of the system.
|
||||
|
||||
- be of moderate size, striking a healthy balance between trivial or
|
||||
mechanical changes (which provide little insight) and large intractable
|
||||
changes that require deeper insight than is available during onboarding to
|
||||
address well. A good size project should have natural touchpoints or
|
||||
check-ins.
|
||||
|
||||
Projects
|
||||
--------
|
||||
|
||||
Before diving into one of these projects, have a conversation about the
|
||||
project or aspects of Tendermint that you're excited to work on with your
|
||||
onboarding buddy. This will help make sure that these issues are still
|
||||
relevant, help you get any context, underatnding known pitfalls, and to
|
||||
confirm a high level approach or design (if relevant.) On-boarding buddies
|
||||
should be prepared to do some design work before someone joins the team.
|
||||
|
||||
The descriptions that follow provide some basic background and attempt to
|
||||
describe the user stories and the potential impact of these project.
|
||||
|
||||
E2E Test Systems
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Tendermint's E2E framework makes it possible to run small test networks with
|
||||
different Tendermint configurations, and make sure that the system works. The
|
||||
tests run Tendermint in a separate binary, and the system provides some very
|
||||
high level protection against making changes that could break Tendermint in
|
||||
otherwise difficult to detect ways.
|
||||
|
||||
Working on the E2E system is a good place to get introduced to the Tendermint
|
||||
codebase, particularly for developers who are newer to Go, as the E2E
|
||||
system (generator, runner, etc.) is distinct from the rest of Tendermint and
|
||||
comparatively quite small, so it may be easier to begin making changes in this
|
||||
area. At the same time, because the E2E system exercises *all* of Tendermint,
|
||||
work in this area is a good way to get introduced to various components of the
|
||||
system.
|
||||
|
||||
Configurable E2E Workloads
|
||||
++++++++++++++++++++++++++
|
||||
|
||||
All E2E tests use the same workload (e.g. generated transactions, submitted to
|
||||
different nodes in the network,) which has been tuned empirically to provide a
|
||||
gentle but consistent parallel load that all E2E tests can pass. Ideally, the
|
||||
workload generator could be configurable to have different shapes of work
|
||||
(bursty, different transaction sizes, weighted to different nodes, etc.) and
|
||||
even perhaps further parameterized within a basic shape, which would make it
|
||||
possible to use our existing test infrastructure to answer different questions
|
||||
about the performance or capability of the system.
|
||||
|
||||
The work would involve adding a new parameter to the E2E test manifest, and
|
||||
creating an option (e.g. "legacy") for the current load generation model,
|
||||
extract configurations options for the current load generation, and then
|
||||
prototype implementations of alternate load generation, and also run some
|
||||
preliminary using the tools.
|
||||
|
||||
Byzantine E2E Workloads
|
||||
+++++++++++++++++++++++
|
||||
|
||||
There are two main kinds of integration tests in Tendermint: the E2E test
|
||||
framework, and then a collection of integration tests that masquerade as
|
||||
unit-tests. While some of this expansion of test scope is (potentially)
|
||||
inevitable, the masquerading unit tests (e.g ``consensus.byzantine_test.go``)
|
||||
end up being difficult to understand, difficult to maintain, and unreliable.
|
||||
|
||||
One solution to this, would be to modify the E2E ABCI application to allow it
|
||||
to inject byzantine behavior, and then have this be a configurable aspect of
|
||||
a test network to be able to provoke Byzantine behavior in a "real" system and
|
||||
then observe that evidence is constructed. This would make it possible to
|
||||
remove the legacy tests entirely once the new tests have proven themselves.
|
||||
|
||||
Abstract Orchestration Framework
|
||||
++++++++++++++++++++++++++++++++
|
||||
|
||||
The orchestration of e2e test processes is presently done using docker
|
||||
compose, which works well, but has proven a bit limiting as all processes need
|
||||
to run on a single machine, and the log aggregation functions are confusing at
|
||||
best.
|
||||
|
||||
This project would replace the current orchestration with something more
|
||||
generic, potentially maintaining the current system, but also allowing the e2e
|
||||
tests to manage processes using k8s. There are a few "local" k8s frameworks
|
||||
(e.g. kind and k3s,) which might be able to be useful for our current testing
|
||||
model, but hopefully, we could use this new implementation with other k8s
|
||||
systems for more flexible distribute test orchestration.
|
||||
|
||||
Improve Operationalize Experience of ``run-multiple.sh``
|
||||
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
The e2e test runner currently runs a single test, and in most cases we manage
|
||||
the test cases using a shell script that ensure cleanup of entire test
|
||||
suites. This is a bit difficult to maintain and makes reproduction of test
|
||||
cases more awkward than it should be. The e2e ``runner`` itself should provide
|
||||
equivalent functionality to ``run-multiple.sh``: ensure cleanup of test cases,
|
||||
collect and process output, and be able to manage entire suites of cases.
|
||||
|
||||
It might also be useful to implement an e2e test orchestrator that runs all
|
||||
tendermint instances in a single process, using "real" networks for faster
|
||||
feedback and iteration during development.
|
||||
|
||||
In addition to being a bit easier to maintain, having a more capable runner
|
||||
implementation would make it easier to collect data from test runs, improve
|
||||
debugability and reporting.
|
||||
|
||||
Fan-Out For CI E2E Tests
|
||||
++++++++++++++++++++++++
|
||||
|
||||
While there are some parallelism in the execution of e2e tests, each e2e test
|
||||
job must build a tendermint e2e image, which takes about 5 minutes of CPU time
|
||||
per-task, which given the size of each of the runs.
|
||||
|
||||
We'd like to be able to reduce the amount of overhead per-e2e tests while
|
||||
keeping the cycle time for working with the tests very low, while also
|
||||
maintaining a reasonable level of test coverage. This is an impossible
|
||||
tradeoff, in some ways, and the percentage of overhead at the moment is large
|
||||
enough that we can make some material progress with a moderate amount of time.
|
||||
|
||||
Most of this work has to do with modifying github actions configuration and
|
||||
e2e artifact (docker) building to reduce redundant work. Eventually, when we
|
||||
can drop the requirement for CGo storage engines, it will be possible to move
|
||||
(cross) compile tendermint locally, and then inject the binary into the docker
|
||||
container, which would reduce a lot of the build-time complexity, although we
|
||||
can move more in this direction or have runtime flags to disable CGo
|
||||
dependencies for local development.
|
||||
|
||||
Remove Panics
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
There are lots of places in the code base which can panic, and would not be
|
||||
particularly well handled. While in some cases, panics are the right answer,
|
||||
in many cases the panics were just added to simplify downstream error
|
||||
checking, and could easily be converted to errors.
|
||||
|
||||
The `Don't Panic RFC
|
||||
<https://github.com/tendermint/tendermint/blob/master/docs/rfc/rfc-008-do-not-panic.MD>`_
|
||||
covers some of the background and approach.
|
||||
|
||||
While the changes are in this project are relatively rote, this will provide
|
||||
exposure to lots of different areas of the codebase as well as insight into
|
||||
how different areas of the codebase interact with eachother, as well as
|
||||
experience with the test suites and infrastructure.
|
||||
|
||||
Implement more Expressive ABCI Applications
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Tendermint maintains two very simple ABCI applications (a KV application used
|
||||
for basic testing, and slightly more advanced test application used in the
|
||||
end-to-end tests). Writing an application would provide a new engineer with
|
||||
useful experiences using Tendermint that mirrors the expierence of downstream
|
||||
users.
|
||||
|
||||
This is more of an exploratory project, but could include providing common
|
||||
interfaces on top of Tendermint consensus for other well known protocols or
|
||||
tools (e.g. ``etcd``) or a DNS server or some other tool.
|
||||
|
||||
Self-Regulating Reactors
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Currently reactors (the internal processes that are responsible for the higher
|
||||
level behavior of Tendermint) can be started and stopped, but have no
|
||||
provision for being paused. These additional semantics may allow Tendermint to
|
||||
pause reactors (and avoid processing their messhages, etc.) and allow better
|
||||
coordination in the future.
|
||||
|
||||
While this is a big project, it's possible to break this apart into many
|
||||
smaller projects: make p2p channels pauseable, add pause/UN-pause hooks to the
|
||||
service implementation and machinery, and finally to modify the reactor
|
||||
implementations to take advantage of these additional semantics
|
||||
|
||||
This project would give an engineer some exposure to the p2p layer of the
|
||||
code, as well as to various aspects of the reactor implementations.
|
||||
|
||||
Metrics
|
||||
~~~~~~~
|
||||
|
||||
Tendermint has a metrics system that is relatively underutilized, and figuring
|
||||
out ways to capture and organize the metrics to provide value to users might
|
||||
provide an interesting set of projects for new engineers on Tendermint.
|
||||
|
||||
Convert Logs to Metrics
|
||||
+++++++++++++++++++++++
|
||||
|
||||
Because the tendermint logs tend to be quite verbose and not particularly
|
||||
actionable, most users largely ignore the logging or run at very low
|
||||
verbosity. While the log statements in the code do describe useful events,
|
||||
taken as a whole the system is not particularly tractable, and particularly at
|
||||
the Debug level, not useful. One solution to this problem is to identify log
|
||||
messages that might be (e.g. increment a counter for certian kinds of errors)
|
||||
|
||||
One approach might be to look at various logging statements, particularly
|
||||
debug statements or errors that are logged but not returned, and see if
|
||||
they're convertable to counters or other metrics.
|
||||
|
||||
Expose Metrics to Tests
|
||||
+++++++++++++++++++++++
|
||||
|
||||
The existing Tendermint test suites replace the metrics infrastructure with
|
||||
no-op implementations, which means that tests can neither verify that metrics
|
||||
are ever recorded, nor can tests use metrics to observe events in the
|
||||
system. Writing an implementation, for testing, that makes it possible to
|
||||
record metrics and provides an API for introspecting this data, as well as
|
||||
potentially writing tests that take advantage of this type, could be useful.
|
||||
|
||||
Logging Metrics
|
||||
+++++++++++++++
|
||||
|
||||
In some systems, the logging system itself can provide some interesting
|
||||
insights for operators: having metrics that track the number of messages at
|
||||
different levels as well as the total number of messages, can act as a canary
|
||||
for the system as a whole.
|
||||
|
||||
This should be achievable by adding an interceptor layer within the logging
|
||||
package itself that can add metrics to the existing system.
|
||||
Reference in New Issue
Block a user