mirror of
https://github.com/tendermint/tendermint.git
synced 2026-01-05 04:55:18 +00:00
Major thanks to @creachadair for guidance in regards to the proto spec and for quick prototyping of a serialized-proto-canonicalizer.
141 lines
6.8 KiB
Markdown
141 lines
6.8 KiB
Markdown
# RFC 007 : Deterministic Proto Byte Serialization
|
|
|
|
## Changelog
|
|
|
|
- 09-Dec-2021: Initial draft (@williambanfield).
|
|
|
|
## Abstract
|
|
|
|
This document discusses the issue of stable byte-representation of serialized messages
|
|
within Tendermint and describes a few possible routes that could be taken to address it.
|
|
|
|
## Background
|
|
|
|
We use the byte representations of wire-format proto messages to produce
|
|
and verify hashes of data within the Tendermint codebase as well as for
|
|
producing and verifying cryptographic signatures over these signed bytes.
|
|
|
|
The protocol buffer [encoding spec][proto-spec-encoding] does not guarantee that the byte representation
|
|
of a protocol buffer message will be the same between two calls to an encoder.
|
|
While there is a mode to force the encoder to produce the same byte representation
|
|
of messages within a single binary, these guarantees are not good enough for our
|
|
use case in Tendermint. We require multiple different versions of a binary running
|
|
Tendermint to be able to inter-operate. Additionally, we require that multiple different
|
|
systems written in _different languages_ be able to participate in different aspects
|
|
of the protocols of Tendermint and be able to verify the integrity of the messages
|
|
they each produce.
|
|
|
|
While this has not yet created a problem that we know of in a running network, we should
|
|
make sure to provide stronger guarantees around the serialized representation of the messages
|
|
used within the Tendermint consensus algorithm to prevent any issue from occurring.
|
|
|
|
|
|
## Discussion
|
|
|
|
Proto has the following points of variability that can produce non-deterministic byte representation:
|
|
|
|
1. Encoding order of fields within a message.
|
|
|
|
Proto allows fields to be encoded in any order and even be repeated.
|
|
|
|
2. Encoding order of elements of a repeated field.
|
|
|
|
`repeated` fields in a proto message can be serialized in any order.
|
|
|
|
3. Presence or absence of default values.
|
|
|
|
Types in proto have defined default values similar to Go's zero values.
|
|
Writing or omitting a default value are both legal ways of encoding a wire message.
|
|
|
|
4. Serialization of 'unknown' fields.
|
|
|
|
Unknown fields can be present when a message is created by a binary with a newer
|
|
version of the proto that contains fields that the deserializer in a different
|
|
binary does not yet know about. Deserializers in binaries that do not know about the field
|
|
will maintain the bytes of the unknown field but not place them into the deserialized structure.
|
|
|
|
We have a few options to consider when producing this stable representation.
|
|
|
|
### Options for deterministic byte representation
|
|
|
|
#### Use only compliant serializers and constrain field usage
|
|
|
|
According to [Cosmos-SDK ADR-27][cosmos-sdk-adr-27], when message types obey a simple
|
|
set of rules, gogoproto produces a consistent byte representation of serialized messages.
|
|
This seems promising, although more research is needed to guarantee gogoproto always
|
|
produces a consistent set of bytes on serialized messages. This would solve the problem
|
|
within Tendermint as written in Go, but would require ensuring that there are similar
|
|
serializers written in other languages that produce the same output as gogoproto.
|
|
|
|
#### Reorder serialized bytes to ensure determinism.
|
|
|
|
The serialized form of a proto message can be transformed into a canonical representation
|
|
by applying simple rules to the serialized bytes. Re-ordering the serialized bytes
|
|
would allow Tendermint to produce a canonical byte representation without having to
|
|
simultaneously maintain a custom proto marshaller.
|
|
|
|
This could be implemented as a function in many languages that performed the following
|
|
producing bytes to sign or hashing:
|
|
|
|
1. Does not add any of the data from unknown fields into the type to hash.
|
|
|
|
Tendermint should not run into a case where it needs to verify the integrity of
|
|
data with unknown fields for the following reasons:
|
|
|
|
The purpose of checking hash equality within Tendermint is to ensure that
|
|
its local copy of data matches the data that the network agreed on. There should
|
|
therefore not be a case where a process is checking hash equality using data that it did not expect
|
|
to receive. What the data represent may be opaque to the process, such as when checking the
|
|
transactions in a block, _but the process will still have expected to receive this data_,
|
|
despite not understanding what their internal structure is. It's not clear what it would
|
|
mean to verify that a block contains data that a process does not know about.
|
|
|
|
The same reasoning applies for signature verification within Tendermint. Processes
|
|
verify that a digital signature signed over a set of bytes by locally reconstructing the
|
|
data structure that the digital signature signed using the process's local data.
|
|
|
|
2. Reordered all message fields to be in tag-sorted order.
|
|
|
|
Tag-sorting top-level fields will place all fields of the same tag in a adjacent
|
|
to eachother within the serialized representation.
|
|
|
|
3. Reordered the contents of all `repeated` fields to be in lexicographically sorted order.
|
|
|
|
`repeated` fields will appear in a message as having the same tag but will contain different
|
|
contents. Therefore, lexicographical sorting will produce a stable ordering of
|
|
fields with the same tag.
|
|
|
|
4. Deleted all default values from the byte representation.
|
|
|
|
Encoders can include default values or omit them. Most encoders appear to omit them
|
|
but we may wish to delete them just to be safe.
|
|
|
|
5. Recursively performed these operations on any length-delimited subfields.
|
|
|
|
Length delimited fields may contain messages, strings, or just bytes. However,
|
|
it's not possible to know what data is being represented by such a field.
|
|
A 'string' may happen to have the same structure as an embedded message and we cannot
|
|
disambiguate. For this reason, we must apply these same rules to all subfields that
|
|
may contain messages. Because we cannot know if we have totally mangled the interior 'string'
|
|
or not, this data should never be deserialized or used for anything beyond hashing.
|
|
|
|
A **prototype** implementation by @creachadair of this can be found in [the wirepb repo][wire-pb].
|
|
This could be implemented in multiple languages more simply than ensuring that there are
|
|
canonical proto serializers that match in each language.
|
|
|
|
### Future work
|
|
|
|
We should add clear documentation to the Tendermint codebase every time we
|
|
compare hashes of proto messages or use proto serialized bytes to produces a
|
|
digital signatures that we have been careful to ensure that the hashes are performed
|
|
properly.
|
|
|
|
### References
|
|
|
|
[proto-spec-encoding]: https://developers.google.com/protocol-buffers/docs/encoding
|
|
[spec-issue]: https://github.com/tendermint/tendermint/issues/5005
|
|
[cosmos-sdk-adr-27]: https://github.com/cosmos/cosmos-sdk/blob/master/docs/architecture/adr-027-deterministic-protobuf-serialization.md
|
|
[cer-proto-3]: https://github.com/regen-network/canonical-proto3
|
|
[wire-pb]: https://github.com/creachadair/wirepb
|
|
|