rfc: onboarding projects (#8413)

This is meant as a supporting recruiting document. The idea is to describe a bunch of projects scoped and selected as teaching projects for new engineers joining the team. This isn't meant to replace "neweng" or "good-first-ticket" tags on issues, but provide a higher level set of examples of the kinds of things that someone joining the team could tackle.
2026-01-09 06:33:16 +00:00 · 2022-05-16 10:15:06 -04:00
parent fb7229135a
commit 7f79661c2e
2 changed files with 241 additions and 0 deletions
--- a/docs/rfc/README.md
+++ b/docs/rfc/README.md
@@ -57,5 +57,6 @@ sections.
 - [RFC-017: ABCI++ Vote Extension Propagation](./rfc-017-abci++-vote-extension-propag.md)
 - [RFC-018: BLS Signature Aggregation Exploration](./rfc-018-bls-agg-exploration.md)
 - [RFC-019: Configuration File Versioning](./rfc-019-config-version.md)
+- [RFC-020: Onboarding Projects](./rfc-020-onboarding-projects.rst)

 <!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
--- a/docs/rfc/rfc-020-onboarding-projects.rst
+++ b/docs/rfc/rfc-020-onboarding-projects.rst
@@ -0,0 +1,240 @@
+=======================================
+RFC 020: Tendermint Onboarding Projects
+=======================================
+
+.. contents::
+   :backlinks: none
+
+Changelog
+---------
+
+- 2022-03-30: Initial draft. (@tychoish)
+- 2022-04-25: Imported document to tendermint repository. (@tychoish)
+
+Overview
+--------
+
+This document describes a collection of projects that might be good for new
+engineers joining the Tendermint Core team. These projects mostly describe
+features that we'd be very excited to see land in the code base, but that are
+intentionally outside of the critical path of a release on the roadmap, and
+have the following properties that we think make good on-boarding projects:
+
+- require relatively little context for the project or its history beyond a
+  more isolated area of the code.
+
+- provide exposure to different areas of the codebase, so new team members
+  will have reason to explore the code base, build relationships with people
+  on the team, and gain experience with more than one area of the system.
+
+- be of moderate size, striking a healthy balance between trivial or
+  mechanical changes (which provide little insight) and large intractable
+  changes that require deeper insight than is available during onboarding to
+  address well. A good size project should have natural touchpoints or
+  check-ins.
+
+Projects
+--------
+
+Before diving into one of these projects, have a conversation about the
+project or aspects of Tendermint that you're excited to work on with your
+onboarding buddy. This will help make sure that these issues are still
+relevant, help you get any context, underatnding known pitfalls, and to
+confirm a high level approach or design (if relevant.) On-boarding buddies
+should be prepared to do some design work before someone joins the team.
+
+The descriptions that follow provide some basic background and attempt to
+describe the user stories and the potential impact of these project.
+
+E2E Test Systems
+~~~~~~~~~~~~~~~~
+
+Tendermint's E2E framework makes it possible to run small test networks with
+different Tendermint configurations, and make sure that the system works. The
+tests run Tendermint in a separate binary, and the system provides some very
+high level protection against making changes that could break Tendermint in
+otherwise difficult to detect ways.
+
+Working on the E2E system is a good place to get introduced to the Tendermint
+codebase, particularly for developers who are newer to Go, as the E2E
+system (generator, runner, etc.) is distinct from the rest of Tendermint and
+comparatively quite small, so it may be easier to begin making changes in this
+area. At the same time, because the E2E system exercises *all* of Tendermint,
+work in this area is a good way to get introduced to various components of the
+system.
+
+Configurable E2E Workloads
++++++++++++++++++++++++++
+
+All E2E tests use the same workload (e.g. generated transactions, submitted to
+different nodes in the network,) which has been tuned empirically to provide a
+gentle but consistent parallel load that all E2E tests can pass. Ideally, the
+workload generator could be configurable to have different shapes of work
+(bursty, different transaction sizes, weighted to different nodes, etc.) and
+even perhaps further parameterized within a basic shape, which would make it
+possible to use our existing test infrastructure to answer different questions
+about the performance or capability of the system.
+
+The work would involve adding a new parameter to the E2E test manifest, and
+creating an option (e.g. "legacy") for the current load generation model,
+extract configurations options for the current load generation, and then
+prototype implementations of alternate load generation, and also run some
+preliminary using the tools.
+
+Byzantine E2E Workloads
+++++++++++++++++++++++
+
+There are two main kinds of integration tests in Tendermint: the E2E test
+framework, and then a collection of integration tests that masquerade as
+unit-tests. While some of this expansion of test scope is (potentially)
+inevitable, the masquerading unit tests (e.g ``consensus.byzantine_test.go``)
+end up being difficult to understand, difficult to maintain, and unreliable.
+
+One solution to this, would be to modify the E2E ABCI application to allow it
+to inject byzantine behavior, and then have this be a configurable aspect of
+a test network to be able to provoke Byzantine behavior in a "real" system and
+then observe that evidence is constructed. This would make it possible to
+remove the legacy tests entirely once the new tests have proven themselves.
+
+Abstract Orchestration Framework
++++++++++++++++++++++++++++++++
+
+The orchestration of e2e test processes is presently done using docker
+compose, which works well, but has proven a bit limiting as all processes need
+to run on a single machine, and the log aggregation functions are confusing at
+best.
+
+This project would replace the current orchestration with something more
+generic, potentially maintaining the current system, but also allowing the e2e
+tests to manage processes using k8s. There are a few "local" k8s frameworks
+(e.g. kind and k3s,) which might be able to be useful for our current testing
+model, but hopefully, we could use this new implementation with other k8s
+systems for more flexible distribute test orchestration.
+
+Improve Operationalize Experience of ``run-multiple.sh``
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+The e2e test runner currently runs a single test, and in most cases we manage
+the test cases using a shell script that ensure cleanup of entire test
+suites. This is a bit difficult to maintain and makes reproduction of test
+cases more awkward than it should be. The e2e ``runner`` itself should provide
+equivalent functionality to ``run-multiple.sh``: ensure cleanup of test cases,
+collect and process output, and be able to manage entire suites of cases.
+
+It might also be useful to implement an e2e test orchestrator that runs all
+tendermint instances in a single process, using "real" networks for faster
+feedback and iteration during development.
+
+In addition to being a bit easier to maintain, having a more capable runner
+implementation would make it easier to collect data from test runs, improve
+debugability and reporting.
+
+Fan-Out For CI E2E Tests
++++++++++++++++++++++++
+
+While there are some parallelism in the execution of e2e tests, each e2e test
+job must build a tendermint e2e image, which takes about 5 minutes of CPU time
+per-task, which given the size of each of the runs.
+
+We'd like to be able to reduce the amount of overhead per-e2e tests while
+keeping the cycle time for working with the tests very low, while also
+maintaining a reasonable level of test coverage.  This is an impossible
+tradeoff, in some ways, and the percentage of overhead at the moment is large
+enough that we can make some material progress with a moderate amount of time.
+
+Most of this work has to do with modifying github actions configuration and
+e2e artifact (docker) building to reduce redundant work. Eventually, when we
+can drop the requirement for CGo storage engines, it will be possible to move
+(cross) compile tendermint locally, and then inject the binary into the docker
+container, which would reduce a lot of the build-time complexity, although we
+can move more in this direction or have runtime flags to disable CGo
+dependencies for local development.
+
+Remove Panics
+~~~~~~~~~~~~~
+
+There are lots of places in the code base which can panic, and would not be
+particularly well handled. While in some cases, panics are the right answer,
+in many cases the panics were just added to simplify downstream error
+checking, and could easily be converted to errors.
+
+The `Don't Panic RFC
+<https://github.com/tendermint/tendermint/blob/master/docs/rfc/rfc-008-do-not-panic.MD>`_
+covers some of the background and approach.
+
+While the changes are in this project are relatively rote, this will provide
+exposure to lots of different areas of the codebase as well as insight into
+how different areas of the codebase interact with eachother, as well as
+experience with the test suites and infrastructure.
+
+Implement more Expressive ABCI Applications
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Tendermint maintains two very simple ABCI applications (a KV application used
+for basic testing, and slightly more advanced test application used in the
+end-to-end tests). Writing an application would provide a new engineer with
+useful experiences using Tendermint that mirrors the expierence of downstream
+users.
+
+This is more of an exploratory project, but could include providing common
+interfaces on top of Tendermint consensus for other well known protocols or
+tools (e.g. ``etcd``) or a DNS server or some other tool.
+
+Self-Regulating Reactors
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Currently reactors (the internal processes that are responsible for the higher
+level behavior of Tendermint) can be started and stopped, but have no
+provision for being paused. These additional semantics may allow Tendermint to
+pause reactors (and avoid processing their messhages, etc.) and allow better
+coordination in the future.
+
+While this is a big project, it's possible to break this apart into many
+smaller projects: make p2p channels pauseable, add pause/UN-pause hooks to the
+service implementation and machinery, and finally to modify the reactor
+implementations to take advantage of these additional semantics
+
+This project would give an engineer some exposure to the p2p layer of the
+code, as well as to various aspects of the reactor implementations.
+
+Metrics
+~~~~~~~
+
+Tendermint has a metrics system that is relatively underutilized, and figuring
+out ways to capture and organize the metrics to provide value to users might
+provide an interesting set of projects for new engineers on Tendermint.
+
+Convert Logs to Metrics
+++++++++++++++++++++++
+
+Because the tendermint logs tend to be quite verbose and not particularly
+actionable, most users largely ignore the logging or run at very low
+verbosity. While the log statements in the code do describe useful events,
+taken as a whole the system is not particularly tractable, and particularly at
+the Debug level, not useful. One solution to this problem is to identify log
+messages that might be (e.g. increment a counter for certian kinds of errors)
+
+One approach might be to look at various logging statements, particularly
+debug statements or errors that are logged but not returned, and see if
+they're convertable to counters or other metrics.
+
+Expose Metrics to Tests
+++++++++++++++++++++++
+
+The existing Tendermint test suites replace the metrics infrastructure with
+no-op implementations, which means that tests can neither verify that metrics
+are ever recorded, nor can tests use metrics to observe events in the
+system. Writing an implementation, for testing, that makes it possible to
+record metrics and provides an API for introspecting this data, as well as
+potentially writing tests that take advantage of this type, could be useful.
+
+Logging Metrics
+++++++++++++++
+
+In some systems, the logging system itself can provide some interesting
+insights for operators: having metrics that track the number of messages at
+different levels as well as the total number of messages, can act as a canary
+for the system as a whole.
+
+This should be achievable by adding an interceptor layer within the logging
+package itself that can add metrics to the existing system.