From ef314708e74c64bfc8fb40f677ed5493c67186c0 Mon Sep 17 00:00:00 2001
From: "M. J. Fromberger" <fromberger@interchain.io>
Date: Thu, 21 Apr 2022 10:08:47 -0700
Subject: [PATCH] RFC 019: Configuration File Versioning (#8379)

This RFC discusses issues in how we migrate configuration data across
Tendermint versions, and some options for how to improve the experience for
node operators in the future.
---
 docs/rfc/README.md                 |   1 +
 docs/rfc/rfc-019-config-version.md | 400 +++++++++++++++++++++++++++++
 2 files changed, 401 insertions(+)
 create mode 100644 docs/rfc/rfc-019-config-version.md

diff --git a/docs/rfc/README.md b/docs/rfc/README.md
index b5c42ea85..f2ad6ad69 100644
--- a/docs/rfc/README.md
+++ b/docs/rfc/README.md
@@ -53,5 +53,6 @@ sections.
 - [RFC-013: ABCI++](./rfc-013-abci++.md)
 - [RFC-014: Semantic Versioning](./rfc-014-semantic-versioning.md)
 - [RFC-015: ABCI++ Tx Mutation](./rfc-015-abci++-tx-mutation.md)
+- [RFC-019: Configuration File Versioning](./rfc-019-config-version.md)
 
 <!-- - [RFC-NNN: Title](./rfc-NNN-title.md) -->
diff --git a/docs/rfc/rfc-019-config-version.md b/docs/rfc/rfc-019-config-version.md
new file mode 100644
index 000000000..3dfd7840b
--- /dev/null
+++ b/docs/rfc/rfc-019-config-version.md
@@ -0,0 +1,400 @@
+# RFC 019: Configuration File Versioning
+
+## Changelog
+
+- 19-Apr-2022: Initial draft (@creachadair)
+- 20-Apr-2022: Updates from review feedback (@creachadair)
+
+## Abstract
+
+Updating configuration settings is an essential part of upgrading an existing
+node to a new version of the Tendermint software.  Unfortunately, it is also
+currently a very manual process. This document discusses some of the history of
+changes to the config format, actions we've taken to improve the tooling for
+configuration upgrades, and additional steps we may want to consider.
+
+## Background
+
+A Tendermint node reads configuration settings at startup from a TOML formatted
+text file, typically named `config.toml`. The contents of this file are defined
+by the [`github.com/tendermint/tendermint/config`][config-pkg].
+
+Although many settings in this file remain valid from one version of Tendermint
+to the next, new versions of Tendermint often add, update, and remove settings.
+These changes often require manual intervention by operators who are upgrading
+their nodes.
+
+I propose we should provide better tools and documentation to help operators
+make configuration changes correctly during version upgrades.  Ideally, as much
+as possible of any configuration file update should be automated, and where
+that is not possible or practical, we should provide clear, explicit directions
+for what steps need to be taken manually. Moreover, when the node discovers
+incorrect or invalid configuration, we should improve the diagnostics it emits
+so that the operator can quickly and easily find the relevant documentation,
+without having to grep through source code.
+
+## Discussion
+
+By convention, we are supposed to document required changes to the config file
+in the `UPGRADING.md` file for the release that introduces them.  Although we
+have mostly done this, the level of detail in the upgrading instructions is
+often insufficient for an operator to correctly update their file.
+
+The updates vary widely in complexity: Operators may need to add new required
+settings, update obsolete values for existing settings, move or rename existing
+settings within the file, or remove obsolete settings (which are thus invalid).
+Here are a few examples of each of these cases:
+
+- **New required settings:** Tendermint v0.35 added a new top-level `mode`
+  setting that determines whether a node runs as a validator, a full node, or a
+  seed node.  The default value is `"full"`, which means the operator of a
+  validator must manually add `mode = "validator"` (or set the `--mode` flag on
+  the command line) for their node to come up in the correct mode.
+
+- **Updated obsolete values:** Tendermint v0.35 removed support for versions
+  `"v1"` and `"v2"` of the blocksync (formerly "fastsync") protocol, requiring
+  any node using either of those values to update to `"v0"`.
+
+- **Moved/renamed settings:** Version v0.34 moved the top-level `pprof_laddr`
+  setting under the `[rpc]` section.
+
+  Version v0.35 renamed every setting in the file from `snake_case` to
+  `kebab-case`, moved the top-level `fast_sync` setting into the `[blocksync]`
+  section as (itself renamed from `[fastsync]`), and moved all the top-level
+  `priv-validator-*` settings under a new `[priv-validator]` section with their
+  prefix trimmed off.
+
+- **Removed obsolete settings:** Version v0.34 removed the `index_all_keys` and
+  `index_keys` settings from the `[tx_index]` section; version v0.35 removed
+  the `wal-dir` setting from the `[mempool]` section, and version v0.36 removed
+  the `[blocksync]` section entirely.
+
+While many of these changes are mentioned in the config section of the upgrade
+instructions, some are not mentioned at all, or are hidden in other parts of
+the doc. For instance, the v0.34 `pprof_laddr` change was documented only as an
+RPC flag change. (A savvy reader might realize that the flag `--rpc.pprof_laddr`
+implies a corresponding config section, but it omits the related detail that
+there was a top-level setting that's been renamed).  The lesson here is not
+that the docs are bad, but to point out that prose is not the most efficient
+format to convey detailed changes like this. The upgrading instructions are
+still valuable for the human reader to understand what to expect.
+
+### Concrete Steps
+
+As part of the v0.36 development cycle, we spent some time reverse-engineering
+the configuration changes since the v0.34 release and built an experimental
+command-line tool called [`confix`][confix], whose job it is to automatically
+update the settings in a `config.toml` file to the latest version.  We also
+backported a version of this tool into the v0.35.x branch at release v0.35.4.
+
+This tool should work fine for configuration files created by Tendermint v0.34
+and later, but does not (yet) know how to handle changes from prior versions of
+Tendermint. Part of the difficulty for older versions is simply logistical: To
+figure out which changes to apply, we need to understand something about the
+version that made the file, as well as the version we're converting it to.
+
+> **Discussion point:** In the future we might want to consider incorporating
+> this into the node CLI directly, but we're keeping it separate for now until
+> we can get some feedback from operators.
+
+For the experiment, we handled this by carefully searching the history of
+config format changes for shibboleths to bound the version: For example, the
+`[fastsync]` section was added in Tendermint v0.32 and renamed `[blocksync]` in
+Tendermint v0.35. So if we see a `[fastsync]` section, we have some confidence
+that the file was created by v0.32, v0.33, or v0.34.
+
+But such signals are delicate: The `[blocksync]` section was removed in v0.36,
+so if we do not find `[fastsync]`, we cannot conclude from that alone that the
+file is from v0.31 or earlier -- we have to look for corroborating details.
+While such "sniffing" tactics are fine for an experiment, they aren't as robust
+as we might like.
+
+This is especially relevant for configuration files that may have already been
+manually upgraded across several versions by the time we are asked to update
+them again.  Another related concern is that we'd like to make sure conversion
+is idempotent, so that it would be safe to rerun the tool over an
+already-converted file without breaking anything.
+
+### Config Versioning
+
+One obvious tactic we could use for future releases is add a version marker to
+the config file. This would give tools like `confix` (and the node itself) a
+way to calibrate their expectations. Rather than being a version for the file
+itself, however, this version marker would indicate which version of Tendermint
+is needed to read the file.
+
+Provisionally, this might look something like:
+
+```toml
+# THe minimum version of Tendermint compatible with the contents of
+# this configuration file.
+config-version = 'v0.35'
+```
+
+When initializing a new node, Tendermint would populate this field with its own
+version (e.g., `v0.36`). When conducting an upgrade, tools like `confix` can
+then use this to decide which conversions are valid, and then update the value
+accordingly. After converting a file marked `'v0.35'` to`'v0.37'`, the
+conversion tool sets the file's `config-version` to reflect its compatibility.
+
+> **Discussion point:** This example presumes we would keep config files
+> compatible within a given release cycle, e.g., all of v0.36.x. We could also
+> use patch numbers here, if we think there's some reason to permit changes
+> that would require config file edits at that granularity. I don't think we
+> should, but that's a design question to consider.
+
+Upon seeing an up-to-date version marker, the conversion tool can simply exit
+with a diagnostic like "this file is already up-to-date", rather than sniffing
+the keyspace and potentially introducing errors. In addition, this would let a
+tool detect config files that are _newer_ than the one it understands, and
+issue a safe diagnostic rather than doing something wrong.  Plus, besides
+avoiding potentially unsafe conversions, this would also serve as
+human-readable documentation that the file is up-to-date for a given version.
+
+Adding a config version would not address the problem of how to convert files
+created by older versions of Tendermint, but it would at least help us build
+more robust config tooling going forward.
+
+### Stability and Change
+
+In light of the discussion so far, it is natural to examine why we make so many
+changes to the configuration file from one version to the next, and whether we
+could reduce friction by being more conservative about what we make
+configurable, what config changes we make over time, and how we roll them out.
+
+Some changes, like renaming everything from snake case to kebab case, are
+entirely gratuitous. We could safely agree not to make those kinds of changes.
+Apart from that obvious case, however, many other configuration settings
+provide value to node operators in cases where there is no simple, universal
+setting that matches every application.
+
+Taking a high-level view, there are several broad reasons why we might want to
+make changes to configuration settings:
+
+- **Lessons learned:** Configuration settings are a good way to try things out
+  in production, before making more invasive changes to the consensus protocol.
+
+  For example, up until Tendermint v0.35, consensus timeouts were specified as
+  per-node configuration settings (e.g., `timeout-precommit` et al.).  This
+  allowed operators to tune these values for the needs of their network, but
+  had the downside that individually-misconfigured nodes could stall consensus.
+
+  Based on that experience, these timeouts have been deprecated in Tendermint
+  v0.36 and converted to consensus parameters, to be consistent across all
+  nodes in the network.
+
+- **Migration & experimentation:** Introducing new features and updating old
+  features can complicate migration for existing users of the software.
+  Temporary or "experimental" configuration settings can be a valuable way to
+  mitigate that friction.
+
+  For example, Tendermint v0.36 introduces a new RPC event subscription
+  endpoint (see [ADR 075][adr075]) that will eventually replace the existing
+  webwocket-based interface. To give users time to migrate, v0.36 adds an
+  `experimental-disable-websocket` setting, defaulted to `false`, that allows
+  operators to selectively disable the websocket API for testing purposes
+  during the conversion. This setting is designed to be removed in v0.37, when
+  the old interface is no longer supported.
+
+- **Ongoing maintenance:** Sometimes configuration settings become obsolete,
+  and the cost of removing them trades off against the potential risks of
+  leaving a non-functional or deprecated knob hooked up indefinitely.
+
+  For example, Tendermint v0.35 deprecated two alternate implementations of the
+  blocksync protocol, one of which was deleted entirely (`v1`) and one of which
+  was scheduled for removal (`v2`). The `blocksync.version` setting, which had
+  been added as a migration aid, became obsolete and needed to be updated.
+
+  Despite our best intentions, sometimes engineering designs do not work out.
+  It's just as important to leave room to back out of changes we have since
+  reconsidered, as it is to support migrations forward onto new and improved
+  code.
+
+- **Clarity and legibility:** Besides configuring the software, another
+  important purpose of a config file is to document intent for the humans who
+  operate and maintain the software. Operators need adjust settings to keep the
+  node running, and developers need to know what options were in use when
+  something goes wrong so they can diagnose and fix bugs.  The legibility of a
+  config file as a _human_ artifact is also thus important.
+
+  For example, Tendermint v0.35 moved settings related to validator private
+  keys from the top-level section of the configuration file to their own
+  designated `[priv-validator]` section. Although this change did not make any
+  difference to the meaning of those settings, it made the organization of the
+  file easier to understand, and allowed the names of the individual settings
+  to be simplified (e.g., `priv-validator-key-file` became simply `key-file` in
+  the new section).
+
+  Although such changes are "gratuitous" with respect to the software, there is
+  often value in making things more legible for the humans. While there is no
+  simple rule to define the line, the Potter Stewart principle can be used with
+  due care.
+
+Keeping these examples in mind, we can and should take reasonable steps to
+avoid churn in the configuration file across versions where we can. However, we
+must also accept that part of the reason for _having_ a config file is to allow
+us flexibility elsewhere in the design.  On that basis, we should not attempt
+to be too dogmatic about config changes either. Unlike changes in the block
+protocol, for example, which affect every user of every network that adopts
+them, config changes are relatively self-contained.
+
+There are few guiding principles I think we can use to strike a sensible
+balance:
+
+1. **No gratuitous changes.** Aesthetic changes that do not enhance legibility,
+   avert confusion, or clarity documentation, should be entirely avoided.
+
+2. **Prefer mechanical changes.** Whenever it is practical, change settings in
+   a way that can be updated by a tool without operator judgement. This implies
+   finding safe, universal defaults for new settings, and not changing the
+   default values of existing settings.
+
+   Even if that means we have to make multiple changes (e.g., add a new setting
+   in the current version, deprecate the old one, and remove the old one in the
+   next version) it's preferable if we can mechanize each step.
+
+3. **Clearly signal intent.** When adding temporary or experimental settings,
+   they should be clearly named and documented as such. Use long names and
+   suggestive prefixes (e.g., `experimental-*`) so that they stand out when
+   read in the config file or printed in logs.
+
+   Relatedly, using temporary or experimental settings should cause the
+   software to emit diagnostic logs at runtime. These log messages should be
+   easy to grep for, and should contain pointers to more complete documentation
+   (say, issue numbers or URLs) that the operator can read, as well as a hint
+   about when the setting is expected to become invalid. For example:
+
+   ```
+   WARNING: Websocket RPC access is deprecated and will be removed in
+   Tendermint v0.37. See https://tinyurl.com/adr075 for more information.
+   ```
+
+4. **Consider both directions.** When adding a configuration setting, take some
+   time during the implementation process to think about how the setting could
+   be removed, as well as how it will be rolled out. This applies even for
+   settings we imagine should be permanent. Experience may cause is to rethink
+   our original design intent more broadly than we expected.
+
+   This does not mean we have to spend a long time picking nits over the design
+   of every setting; merely that we should convince ourselves we _could_ undo
+   it without making too big a mess later. Even a little extra effort up front
+   can sometimes save a lot.
+
+## References
+
+- [Tendermint `config` package][config-pkg]
+- [`confix` command-line tool][confix]
+- [`condiff` command-line tool][condiff]
+- [Configuration update plan][plan]
+- [ADR 075: RPC Event Subscription Interface][adr075]
+
+[config-pkg]: https://godoc.org/github.com/tendermint/tendermint/config
+[confix]: https://github.com/tendermint/tendermint/blob/master/scripts/confix
+[condiff]: https://github.com/tendermint/tendermint/blob/master/scripts/confix/condiff
+[plan]: https://github.com/tendermint/tendermint/blob/master/scripts/confix/plan.go
+[testdata]: https://github.com/tendermint/tendermint/blob/master/scripts/confix/testdata
+[adr075]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-075-rpc-subscription.md
+
+## Appendix: Research Notes
+
+Discovering when various configuration settings were added, updated, and
+removed turns out to be surprisingly tedious.  To solve this puzzle, we had to
+answer the following questions:
+
+1. What changes were made between v0.x and v0.y? This is further complicated by
+   cases where we have backported config changes into the middle of an earlier
+   release cycle (e.g., `psql-conn` from v0.35.x into v0.34.13).
+
+2. When during the development cycle were those changes made? This allows us to
+   recognize features that were backported into a previous release.
+
+3. What were the default values of the changed settings, and did they change at
+   all during or across the release boundary?
+
+Each step of the [configuration update plan][plan] is commented with a link to
+one or more PRs where that change was made. The sections below discuss how we
+found these references.
+
+### Tracking Changes Across Releases
+
+To figure out what changed between two releases, we built a tool called
+[`condiff`][condiff], which performs a "keyspace" diff of two TOML documents.
+This diff respects the structure of the TOML file, but ignores comments, blank
+lines, and configuration values, so that we can see what was added and removed.
+
+To use it, run:
+
+```shell
+go run ./scripts/confix/condiff old.toml new.toml
+```
+
+This tool works on any TOML documents, but for our purposes we needed
+Tendermint `config.toml` files. The easiest way to get these is to build the
+node binary for your version of interest, run `tendermint init` on a clean home
+directory, and copy the generated config file out. The [`testdata`][testdata]
+directory for the `confix` tool has configs generated from the heads of each
+release branch from v0.31 through v0.35.
+
+If you want to reproduce this yourself, it looks something like this:
+
+```shell
+# Example for Tendermint v0.32.
+git checkout --track origin/v0.32.x
+go get golang.org/x/sys/unix
+go mod tidy
+make build
+rm -fr -- tmhome
+./build/tendermint --home=tmhome init
+cp tmhome/config/config.toml config-v32.toml
+```
+
+Be advised that the further back you go, the more idiosyncrasies you will
+encounter. For example, Tendermint v0.31 and earlier predate Go modules (v0.31
+used dep), and lack backport branches. And you may need to do some editing of
+Makefile rules once you get back into the 20s.
+
+Note that when diffing config files across the v0.34/v0.35 gap, the swap from
+`snake_case` to `kebab-case` makes it look like everything changed. The
+`condiff` tool has a `-desnake` flag that normalizes all the keys to kebab case
+in both inputs before comparison.
+
+### Locating Additions and Deletions
+
+To figure out when a configuration setting was added or removed, your tool of
+choice is `git bisect`. The only tricky part is finding the endpoints for the
+search.  If the transition happened within a release, you can use that
+release's backport branch as the endpoint (if it has one, e.g., `v0.35.x`).
+
+However, the start point can be more problematic. The backport branches are not
+ancestors of `master` or of each other, which means you need to find some point
+in history _prior_ to the change but still attached to the mainline. For recent
+releases there is a dev root (e.g., `v0.35.0-dev`, `v0.34.0-dev1`, etc.). These
+are not named consistently, but you can usually grep the output of `git tag` to
+find them.
+
+In the worst case you could try starting from the root commit of the repo, but
+that turns out not to work in all cases. We've done some branching shenanigans
+over the years that mean the root is not a direct ancestor of all our release
+branches. When you find this you will probably swear a lot. I did.
+
+Once you have a start and end point (say, `v0.35.0-dev` and `master`), you can
+bisect in the usual way. I use `git grep` on the `config` directory to check
+whether the case I am looking for is present. For example, to find when the
+`[fastsync]` section was removed:
+
+```shell
+# Setup:
+git checkout master
+git bisect start
+git bisect bad                 # it's not present on tip of master.
+git bisect good v0.34.0-dev1   # it was present at the start of v0.34.
+```
+
+```shell
+# Now repeat this until it gives you a specific commit:
+if git grep -q '\[fastsync\]' config ; then git bisect good ; else git bisect bad ; fi
+```
+
+The above example finds where a config was removed: To find where a setting was
+added, do the same thing except reverse the sense of the test (`if ! git grep -q
+...`).