From ef314708e74c64bfc8fb40f677ed5493c67186c0 Mon Sep 17 00:00:00 2001 From: "M. J. Fromberger" Date: Thu, 21 Apr 2022 10:08:47 -0700 Subject: [PATCH] RFC 019: Configuration File Versioning (#8379) This RFC discusses issues in how we migrate configuration data across Tendermint versions, and some options for how to improve the experience for node operators in the future. --- docs/rfc/README.md | 1 + docs/rfc/rfc-019-config-version.md | 400 +++++++++++++++++++++++++++++ 2 files changed, 401 insertions(+) create mode 100644 docs/rfc/rfc-019-config-version.md diff --git a/docs/rfc/README.md b/docs/rfc/README.md index b5c42ea85..f2ad6ad69 100644 --- a/docs/rfc/README.md +++ b/docs/rfc/README.md @@ -53,5 +53,6 @@ sections. - [RFC-013: ABCI++](./rfc-013-abci++.md) - [RFC-014: Semantic Versioning](./rfc-014-semantic-versioning.md) - [RFC-015: ABCI++ Tx Mutation](./rfc-015-abci++-tx-mutation.md) +- [RFC-019: Configuration File Versioning](./rfc-019-config-version.md) diff --git a/docs/rfc/rfc-019-config-version.md b/docs/rfc/rfc-019-config-version.md new file mode 100644 index 000000000..3dfd7840b --- /dev/null +++ b/docs/rfc/rfc-019-config-version.md @@ -0,0 +1,400 @@ +# RFC 019: Configuration File Versioning + +## Changelog + +- 19-Apr-2022: Initial draft (@creachadair) +- 20-Apr-2022: Updates from review feedback (@creachadair) + +## Abstract + +Updating configuration settings is an essential part of upgrading an existing +node to a new version of the Tendermint software. Unfortunately, it is also +currently a very manual process. This document discusses some of the history of +changes to the config format, actions we've taken to improve the tooling for +configuration upgrades, and additional steps we may want to consider. + +## Background + +A Tendermint node reads configuration settings at startup from a TOML formatted +text file, typically named `config.toml`. The contents of this file are defined +by the [`github.com/tendermint/tendermint/config`][config-pkg]. + +Although many settings in this file remain valid from one version of Tendermint +to the next, new versions of Tendermint often add, update, and remove settings. +These changes often require manual intervention by operators who are upgrading +their nodes. + +I propose we should provide better tools and documentation to help operators +make configuration changes correctly during version upgrades. Ideally, as much +as possible of any configuration file update should be automated, and where +that is not possible or practical, we should provide clear, explicit directions +for what steps need to be taken manually. Moreover, when the node discovers +incorrect or invalid configuration, we should improve the diagnostics it emits +so that the operator can quickly and easily find the relevant documentation, +without having to grep through source code. + +## Discussion + +By convention, we are supposed to document required changes to the config file +in the `UPGRADING.md` file for the release that introduces them. Although we +have mostly done this, the level of detail in the upgrading instructions is +often insufficient for an operator to correctly update their file. + +The updates vary widely in complexity: Operators may need to add new required +settings, update obsolete values for existing settings, move or rename existing +settings within the file, or remove obsolete settings (which are thus invalid). +Here are a few examples of each of these cases: + +- **New required settings:** Tendermint v0.35 added a new top-level `mode` + setting that determines whether a node runs as a validator, a full node, or a + seed node. The default value is `"full"`, which means the operator of a + validator must manually add `mode = "validator"` (or set the `--mode` flag on + the command line) for their node to come up in the correct mode. + +- **Updated obsolete values:** Tendermint v0.35 removed support for versions + `"v1"` and `"v2"` of the blocksync (formerly "fastsync") protocol, requiring + any node using either of those values to update to `"v0"`. + +- **Moved/renamed settings:** Version v0.34 moved the top-level `pprof_laddr` + setting under the `[rpc]` section. + + Version v0.35 renamed every setting in the file from `snake_case` to + `kebab-case`, moved the top-level `fast_sync` setting into the `[blocksync]` + section as (itself renamed from `[fastsync]`), and moved all the top-level + `priv-validator-*` settings under a new `[priv-validator]` section with their + prefix trimmed off. + +- **Removed obsolete settings:** Version v0.34 removed the `index_all_keys` and + `index_keys` settings from the `[tx_index]` section; version v0.35 removed + the `wal-dir` setting from the `[mempool]` section, and version v0.36 removed + the `[blocksync]` section entirely. + +While many of these changes are mentioned in the config section of the upgrade +instructions, some are not mentioned at all, or are hidden in other parts of +the doc. For instance, the v0.34 `pprof_laddr` change was documented only as an +RPC flag change. (A savvy reader might realize that the flag `--rpc.pprof_laddr` +implies a corresponding config section, but it omits the related detail that +there was a top-level setting that's been renamed). The lesson here is not +that the docs are bad, but to point out that prose is not the most efficient +format to convey detailed changes like this. The upgrading instructions are +still valuable for the human reader to understand what to expect. + +### Concrete Steps + +As part of the v0.36 development cycle, we spent some time reverse-engineering +the configuration changes since the v0.34 release and built an experimental +command-line tool called [`confix`][confix], whose job it is to automatically +update the settings in a `config.toml` file to the latest version. We also +backported a version of this tool into the v0.35.x branch at release v0.35.4. + +This tool should work fine for configuration files created by Tendermint v0.34 +and later, but does not (yet) know how to handle changes from prior versions of +Tendermint. Part of the difficulty for older versions is simply logistical: To +figure out which changes to apply, we need to understand something about the +version that made the file, as well as the version we're converting it to. + +> **Discussion point:** In the future we might want to consider incorporating +> this into the node CLI directly, but we're keeping it separate for now until +> we can get some feedback from operators. + +For the experiment, we handled this by carefully searching the history of +config format changes for shibboleths to bound the version: For example, the +`[fastsync]` section was added in Tendermint v0.32 and renamed `[blocksync]` in +Tendermint v0.35. So if we see a `[fastsync]` section, we have some confidence +that the file was created by v0.32, v0.33, or v0.34. + +But such signals are delicate: The `[blocksync]` section was removed in v0.36, +so if we do not find `[fastsync]`, we cannot conclude from that alone that the +file is from v0.31 or earlier -- we have to look for corroborating details. +While such "sniffing" tactics are fine for an experiment, they aren't as robust +as we might like. + +This is especially relevant for configuration files that may have already been +manually upgraded across several versions by the time we are asked to update +them again. Another related concern is that we'd like to make sure conversion +is idempotent, so that it would be safe to rerun the tool over an +already-converted file without breaking anything. + +### Config Versioning + +One obvious tactic we could use for future releases is add a version marker to +the config file. This would give tools like `confix` (and the node itself) a +way to calibrate their expectations. Rather than being a version for the file +itself, however, this version marker would indicate which version of Tendermint +is needed to read the file. + +Provisionally, this might look something like: + +```toml +# THe minimum version of Tendermint compatible with the contents of +# this configuration file. +config-version = 'v0.35' +``` + +When initializing a new node, Tendermint would populate this field with its own +version (e.g., `v0.36`). When conducting an upgrade, tools like `confix` can +then use this to decide which conversions are valid, and then update the value +accordingly. After converting a file marked `'v0.35'` to`'v0.37'`, the +conversion tool sets the file's `config-version` to reflect its compatibility. + +> **Discussion point:** This example presumes we would keep config files +> compatible within a given release cycle, e.g., all of v0.36.x. We could also +> use patch numbers here, if we think there's some reason to permit changes +> that would require config file edits at that granularity. I don't think we +> should, but that's a design question to consider. + +Upon seeing an up-to-date version marker, the conversion tool can simply exit +with a diagnostic like "this file is already up-to-date", rather than sniffing +the keyspace and potentially introducing errors. In addition, this would let a +tool detect config files that are _newer_ than the one it understands, and +issue a safe diagnostic rather than doing something wrong. Plus, besides +avoiding potentially unsafe conversions, this would also serve as +human-readable documentation that the file is up-to-date for a given version. + +Adding a config version would not address the problem of how to convert files +created by older versions of Tendermint, but it would at least help us build +more robust config tooling going forward. + +### Stability and Change + +In light of the discussion so far, it is natural to examine why we make so many +changes to the configuration file from one version to the next, and whether we +could reduce friction by being more conservative about what we make +configurable, what config changes we make over time, and how we roll them out. + +Some changes, like renaming everything from snake case to kebab case, are +entirely gratuitous. We could safely agree not to make those kinds of changes. +Apart from that obvious case, however, many other configuration settings +provide value to node operators in cases where there is no simple, universal +setting that matches every application. + +Taking a high-level view, there are several broad reasons why we might want to +make changes to configuration settings: + +- **Lessons learned:** Configuration settings are a good way to try things out + in production, before making more invasive changes to the consensus protocol. + + For example, up until Tendermint v0.35, consensus timeouts were specified as + per-node configuration settings (e.g., `timeout-precommit` et al.). This + allowed operators to tune these values for the needs of their network, but + had the downside that individually-misconfigured nodes could stall consensus. + + Based on that experience, these timeouts have been deprecated in Tendermint + v0.36 and converted to consensus parameters, to be consistent across all + nodes in the network. + +- **Migration & experimentation:** Introducing new features and updating old + features can complicate migration for existing users of the software. + Temporary or "experimental" configuration settings can be a valuable way to + mitigate that friction. + + For example, Tendermint v0.36 introduces a new RPC event subscription + endpoint (see [ADR 075][adr075]) that will eventually replace the existing + webwocket-based interface. To give users time to migrate, v0.36 adds an + `experimental-disable-websocket` setting, defaulted to `false`, that allows + operators to selectively disable the websocket API for testing purposes + during the conversion. This setting is designed to be removed in v0.37, when + the old interface is no longer supported. + +- **Ongoing maintenance:** Sometimes configuration settings become obsolete, + and the cost of removing them trades off against the potential risks of + leaving a non-functional or deprecated knob hooked up indefinitely. + + For example, Tendermint v0.35 deprecated two alternate implementations of the + blocksync protocol, one of which was deleted entirely (`v1`) and one of which + was scheduled for removal (`v2`). The `blocksync.version` setting, which had + been added as a migration aid, became obsolete and needed to be updated. + + Despite our best intentions, sometimes engineering designs do not work out. + It's just as important to leave room to back out of changes we have since + reconsidered, as it is to support migrations forward onto new and improved + code. + +- **Clarity and legibility:** Besides configuring the software, another + important purpose of a config file is to document intent for the humans who + operate and maintain the software. Operators need adjust settings to keep the + node running, and developers need to know what options were in use when + something goes wrong so they can diagnose and fix bugs. The legibility of a + config file as a _human_ artifact is also thus important. + + For example, Tendermint v0.35 moved settings related to validator private + keys from the top-level section of the configuration file to their own + designated `[priv-validator]` section. Although this change did not make any + difference to the meaning of those settings, it made the organization of the + file easier to understand, and allowed the names of the individual settings + to be simplified (e.g., `priv-validator-key-file` became simply `key-file` in + the new section). + + Although such changes are "gratuitous" with respect to the software, there is + often value in making things more legible for the humans. While there is no + simple rule to define the line, the Potter Stewart principle can be used with + due care. + +Keeping these examples in mind, we can and should take reasonable steps to +avoid churn in the configuration file across versions where we can. However, we +must also accept that part of the reason for _having_ a config file is to allow +us flexibility elsewhere in the design. On that basis, we should not attempt +to be too dogmatic about config changes either. Unlike changes in the block +protocol, for example, which affect every user of every network that adopts +them, config changes are relatively self-contained. + +There are few guiding principles I think we can use to strike a sensible +balance: + +1. **No gratuitous changes.** Aesthetic changes that do not enhance legibility, + avert confusion, or clarity documentation, should be entirely avoided. + +2. **Prefer mechanical changes.** Whenever it is practical, change settings in + a way that can be updated by a tool without operator judgement. This implies + finding safe, universal defaults for new settings, and not changing the + default values of existing settings. + + Even if that means we have to make multiple changes (e.g., add a new setting + in the current version, deprecate the old one, and remove the old one in the + next version) it's preferable if we can mechanize each step. + +3. **Clearly signal intent.** When adding temporary or experimental settings, + they should be clearly named and documented as such. Use long names and + suggestive prefixes (e.g., `experimental-*`) so that they stand out when + read in the config file or printed in logs. + + Relatedly, using temporary or experimental settings should cause the + software to emit diagnostic logs at runtime. These log messages should be + easy to grep for, and should contain pointers to more complete documentation + (say, issue numbers or URLs) that the operator can read, as well as a hint + about when the setting is expected to become invalid. For example: + + ``` + WARNING: Websocket RPC access is deprecated and will be removed in + Tendermint v0.37. See https://tinyurl.com/adr075 for more information. + ``` + +4. **Consider both directions.** When adding a configuration setting, take some + time during the implementation process to think about how the setting could + be removed, as well as how it will be rolled out. This applies even for + settings we imagine should be permanent. Experience may cause is to rethink + our original design intent more broadly than we expected. + + This does not mean we have to spend a long time picking nits over the design + of every setting; merely that we should convince ourselves we _could_ undo + it without making too big a mess later. Even a little extra effort up front + can sometimes save a lot. + +## References + +- [Tendermint `config` package][config-pkg] +- [`confix` command-line tool][confix] +- [`condiff` command-line tool][condiff] +- [Configuration update plan][plan] +- [ADR 075: RPC Event Subscription Interface][adr075] + +[config-pkg]: https://godoc.org/github.com/tendermint/tendermint/config +[confix]: https://github.com/tendermint/tendermint/blob/master/scripts/confix +[condiff]: https://github.com/tendermint/tendermint/blob/master/scripts/confix/condiff +[plan]: https://github.com/tendermint/tendermint/blob/master/scripts/confix/plan.go +[testdata]: https://github.com/tendermint/tendermint/blob/master/scripts/confix/testdata +[adr075]: https://github.com/tendermint/tendermint/blob/master/docs/architecture/adr-075-rpc-subscription.md + +## Appendix: Research Notes + +Discovering when various configuration settings were added, updated, and +removed turns out to be surprisingly tedious. To solve this puzzle, we had to +answer the following questions: + +1. What changes were made between v0.x and v0.y? This is further complicated by + cases where we have backported config changes into the middle of an earlier + release cycle (e.g., `psql-conn` from v0.35.x into v0.34.13). + +2. When during the development cycle were those changes made? This allows us to + recognize features that were backported into a previous release. + +3. What were the default values of the changed settings, and did they change at + all during or across the release boundary? + +Each step of the [configuration update plan][plan] is commented with a link to +one or more PRs where that change was made. The sections below discuss how we +found these references. + +### Tracking Changes Across Releases + +To figure out what changed between two releases, we built a tool called +[`condiff`][condiff], which performs a "keyspace" diff of two TOML documents. +This diff respects the structure of the TOML file, but ignores comments, blank +lines, and configuration values, so that we can see what was added and removed. + +To use it, run: + +```shell +go run ./scripts/confix/condiff old.toml new.toml +``` + +This tool works on any TOML documents, but for our purposes we needed +Tendermint `config.toml` files. The easiest way to get these is to build the +node binary for your version of interest, run `tendermint init` on a clean home +directory, and copy the generated config file out. The [`testdata`][testdata] +directory for the `confix` tool has configs generated from the heads of each +release branch from v0.31 through v0.35. + +If you want to reproduce this yourself, it looks something like this: + +```shell +# Example for Tendermint v0.32. +git checkout --track origin/v0.32.x +go get golang.org/x/sys/unix +go mod tidy +make build +rm -fr -- tmhome +./build/tendermint --home=tmhome init +cp tmhome/config/config.toml config-v32.toml +``` + +Be advised that the further back you go, the more idiosyncrasies you will +encounter. For example, Tendermint v0.31 and earlier predate Go modules (v0.31 +used dep), and lack backport branches. And you may need to do some editing of +Makefile rules once you get back into the 20s. + +Note that when diffing config files across the v0.34/v0.35 gap, the swap from +`snake_case` to `kebab-case` makes it look like everything changed. The +`condiff` tool has a `-desnake` flag that normalizes all the keys to kebab case +in both inputs before comparison. + +### Locating Additions and Deletions + +To figure out when a configuration setting was added or removed, your tool of +choice is `git bisect`. The only tricky part is finding the endpoints for the +search. If the transition happened within a release, you can use that +release's backport branch as the endpoint (if it has one, e.g., `v0.35.x`). + +However, the start point can be more problematic. The backport branches are not +ancestors of `master` or of each other, which means you need to find some point +in history _prior_ to the change but still attached to the mainline. For recent +releases there is a dev root (e.g., `v0.35.0-dev`, `v0.34.0-dev1`, etc.). These +are not named consistently, but you can usually grep the output of `git tag` to +find them. + +In the worst case you could try starting from the root commit of the repo, but +that turns out not to work in all cases. We've done some branching shenanigans +over the years that mean the root is not a direct ancestor of all our release +branches. When you find this you will probably swear a lot. I did. + +Once you have a start and end point (say, `v0.35.0-dev` and `master`), you can +bisect in the usual way. I use `git grep` on the `config` directory to check +whether the case I am looking for is present. For example, to find when the +`[fastsync]` section was removed: + +```shell +# Setup: +git checkout master +git bisect start +git bisect bad # it's not present on tip of master. +git bisect good v0.34.0-dev1 # it was present at the start of v0.34. +``` + +```shell +# Now repeat this until it gives you a specific commit: +if git grep -q '\[fastsync\]' config ; then git bisect good ; else git bisect bad ; fi +``` + +The above example finds where a config was removed: To find where a setting was +added, do the same thing except reverse the sense of the test (`if ! git grep -q +...`).