mirror of
https://github.com/tendermint/tendermint.git
synced 2026-01-05 04:55:18 +00:00
docs: rename tendermint-core to system (#6515)
This commit is contained in:
@@ -8,6 +8,14 @@ order: 3
|
||||
|
||||
To download pre-built binaries, see the [releases page](https://github.com/tendermint/tendermint/releases).
|
||||
|
||||
## Using Homebrew
|
||||
|
||||
You can also install the Tendermint binary by simply using homebrew,
|
||||
|
||||
```
|
||||
brew install tendermint
|
||||
```
|
||||
|
||||
## From Source
|
||||
|
||||
You'll need `go` [installed](https://golang.org/doc/install) and the required
|
||||
@@ -18,14 +26,14 @@ echo export GOPATH=\"\$HOME/go\" >> ~/.bash_profile
|
||||
echo export PATH=\"\$PATH:\$GOPATH/bin\" >> ~/.bash_profile
|
||||
```
|
||||
|
||||
### Get Source Code
|
||||
Get the source code:
|
||||
|
||||
```sh
|
||||
git clone https://github.com/tendermint/tendermint.git
|
||||
cd tendermint
|
||||
```
|
||||
|
||||
### Compile
|
||||
Then run:
|
||||
|
||||
```sh
|
||||
make install
|
||||
|
||||
@@ -7,27 +7,8 @@ order: 2
|
||||
## Overview
|
||||
|
||||
This is a quick start guide. If you have a vague idea about how Tendermint
|
||||
works and want to get started right away, continue.
|
||||
|
||||
## Install
|
||||
|
||||
### Quick Install
|
||||
|
||||
To quickly get Tendermint installed on a fresh
|
||||
Ubuntu 16.04 machine, use [this script](https://git.io/fFfOR).
|
||||
|
||||
> :warning: Do not copy scripts to run on your machine without knowing what they do.
|
||||
|
||||
```sh
|
||||
curl -L https://git.io/fFfOR | bash
|
||||
source ~/.profile
|
||||
```
|
||||
|
||||
The script is also used to facilitate cluster deployment below.
|
||||
|
||||
### Manual Install
|
||||
|
||||
For manual installation, see the [install instructions](install.md)
|
||||
works and want to get started right away, continue. Make sure you've installed the binary.
|
||||
Check out [install](./install.md) if you haven't.
|
||||
|
||||
## Initialization
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
order: 1
|
||||
parent:
|
||||
title: Networks
|
||||
order: 5
|
||||
order: 6
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
@@ -5,13 +5,17 @@ parent:
|
||||
order: 4
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
This section will focus on how to operate full nodes, validators and light clients.
|
||||
|
||||
- [Node Types](#node-types)
|
||||
- [Configuration](./configuration.md)
|
||||
- [Configure State sync](./state_sync.md)
|
||||
- [Configure State sync](./state-sync.md)
|
||||
- [Validator Guides](./validators.md)
|
||||
- [Running in Production](./running-in-production.md)
|
||||
- [How to secure your keys](./validators.md#validator_keys)
|
||||
- [Remote Signer](./remote-signer.md)
|
||||
- [Light Client guides](./light-client.md)
|
||||
- [How to sync a light client](./light-client.md#)
|
||||
- [Metrics](./metrics.md)
|
||||
|
||||
374
docs/nodes/running-in-production.md
Normal file
374
docs/nodes/running-in-production.md
Normal file
@@ -0,0 +1,374 @@
|
||||
---
|
||||
order: 4
|
||||
---
|
||||
|
||||
# Running in production
|
||||
|
||||
If you are building Tendermint from source for use in production, make sure to check out an appropriate Git tag instead of a branch.
|
||||
|
||||
## Database
|
||||
|
||||
By default, Tendermint uses the `syndtr/goleveldb` package for its in-process
|
||||
key-value database. If you want maximal performance, it may be best to install
|
||||
the real C-implementation of LevelDB and compile Tendermint to use that using
|
||||
`make build TENDERMINT_BUILD_OPTIONS=cleveldb`. See the [install
|
||||
instructions](../introduction/install.md) for details.
|
||||
|
||||
Tendermint keeps multiple distinct databases in the `$TMROOT/data`:
|
||||
|
||||
- `blockstore.db`: Keeps the entire blockchain - stores blocks,
|
||||
block commits, and block meta data, each indexed by height. Used to sync new
|
||||
peers.
|
||||
- `evidence.db`: Stores all verified evidence of misbehaviour.
|
||||
- `state.db`: Stores the current blockchain state (ie. height, validators,
|
||||
consensus params). Only grows if consensus params or validators change. Also
|
||||
used to temporarily store intermediate results during block processing.
|
||||
- `tx_index.db`: Indexes txs (and their results) by tx hash and by DeliverTx result events.
|
||||
|
||||
By default, Tendermint will only index txs by their hash and height, not by their DeliverTx
|
||||
result events. See [indexing transactions](../app-dev/indexing-transactions.md) for
|
||||
details.
|
||||
|
||||
Applications can expose block pruning strategies to the node operator. Please read the documentation of your application
|
||||
to find out more details.
|
||||
|
||||
Applications can use [state sync](state-sync.md) to help nodes bootstrap quickly.
|
||||
|
||||
## Logging
|
||||
|
||||
Default logging level (`log-level = "main:info,state:info,statesync:info,*:error"`) should suffice for
|
||||
normal operation mode. Read [this
|
||||
post](https://blog.cosmos.network/one-of-the-exciting-new-features-in-0-10-0-release-is-smart-log-level-flag-e2506b4ab756)
|
||||
for details on how to configure `log-level` config variable. Some of the
|
||||
modules can be found [here](../nodes/logging#list-of-modules). If
|
||||
you're trying to debug Tendermint or asked to provide logs with debug
|
||||
logging level, you can do so by running Tendermint with
|
||||
`--log-level="*:debug"`.
|
||||
|
||||
### Consensus WAL
|
||||
|
||||
Tendermint uses a write ahead log (WAL) for consensus. The `consensus.wal` is used to ensure we can recover from a crash at any point
|
||||
in the consensus state machine. It writes all consensus messages (timeouts, proposals, block part, or vote)
|
||||
to a single file, flushing to disk before processing messages from its own
|
||||
validator. Since Tendermint validators are expected to never sign a conflicting vote, the
|
||||
WAL ensures we can always recover deterministically to the latest state of the consensus without
|
||||
using the network or re-signing any consensus messages. The consensus WAL max size of 1GB and is automatically rotated.
|
||||
|
||||
If your `consensus.wal` is corrupted, see [below](#wal-corruption).
|
||||
|
||||
## DOS Exposure and Mitigation
|
||||
|
||||
Validators are supposed to setup [Sentry Node
|
||||
Architecture](./validators.md)
|
||||
to prevent Denial-of-service attacks.
|
||||
|
||||
### P2P
|
||||
|
||||
The core of the Tendermint peer-to-peer system is `MConnection`. Each
|
||||
connection has `MaxPacketMsgPayloadSize`, which is the maximum packet
|
||||
size and bounded send & receive queues. One can impose restrictions on
|
||||
send & receive rate per connection (`SendRate`, `RecvRate`).
|
||||
|
||||
The number of open P2P connections can become quite large, and hit the operating system's open
|
||||
file limit (since TCP connections are considered files on UNIX-based systems). Nodes should be
|
||||
given a sizable open file limit, e.g. 8192, via `ulimit -n 8192` or other deployment-specific
|
||||
mechanisms.
|
||||
|
||||
### RPC
|
||||
|
||||
Endpoints returning multiple entries are limited by default to return 30
|
||||
elements (100 max). See the [RPC Documentation](https://docs.tendermint.com/master/rpc/)
|
||||
for more information.
|
||||
|
||||
Rate-limiting and authentication are another key aspects to help protect
|
||||
against DOS attacks. Validators are supposed to use external tools like
|
||||
[NGINX](https://www.nginx.com/blog/rate-limiting-nginx/) or
|
||||
[traefik](https://docs.traefik.io/middlewares/ratelimit/)
|
||||
to achieve the same things.
|
||||
|
||||
## Debugging Tendermint
|
||||
|
||||
If you ever have to debug Tendermint, the first thing you should probably do is
|
||||
check out the logs. See [Logging](../nodes/logging.md), where we
|
||||
explain what certain log statements mean.
|
||||
|
||||
If, after skimming through the logs, things are not clear still, the next thing
|
||||
to try is querying the `/status` RPC endpoint. It provides the necessary info:
|
||||
whenever the node is syncing or not, what height it is on, etc.
|
||||
|
||||
```bash
|
||||
curl http(s)://{ip}:{rpcPort}/status
|
||||
```
|
||||
|
||||
`/dump_consensus_state` will give you a detailed overview of the consensus
|
||||
state (proposer, latest validators, peers states). From it, you should be able
|
||||
to figure out why, for example, the network had halted.
|
||||
|
||||
```bash
|
||||
curl http(s)://{ip}:{rpcPort}/dump_consensus_state
|
||||
```
|
||||
|
||||
There is a reduced version of this endpoint - `/consensus_state`, which returns
|
||||
just the votes seen at the current height.
|
||||
|
||||
If, after consulting with the logs and above endpoints, you still have no idea
|
||||
what's happening, consider using `tendermint debug kill` sub-command. This
|
||||
command will scrap all the available info and kill the process. See
|
||||
[Debugging](../tools/debugging.md) for the exact format.
|
||||
|
||||
You can inspect the resulting archive yourself or create an issue on
|
||||
[Github](https://github.com/tendermint/tendermint). Before opening an issue
|
||||
however, be sure to check if there's [no existing
|
||||
issue](https://github.com/tendermint/tendermint/issues) already.
|
||||
|
||||
## Monitoring Tendermint
|
||||
|
||||
Each Tendermint instance has a standard `/health` RPC endpoint, which responds
|
||||
with 200 (OK) if everything is fine and 500 (or no response) - if something is
|
||||
wrong.
|
||||
|
||||
Other useful endpoints include mentioned earlier `/status`, `/net_info` and
|
||||
`/validators`.
|
||||
|
||||
Tendermint also can report and serve Prometheus metrics. See
|
||||
[Metrics](./metrics.md).
|
||||
|
||||
`tendermint debug dump` sub-command can be used to periodically dump useful
|
||||
information into an archive. See [Debugging](../tools/debugging.md) for more
|
||||
information.
|
||||
|
||||
## What happens when my app dies
|
||||
|
||||
You are supposed to run Tendermint under a [process
|
||||
supervisor](https://en.wikipedia.org/wiki/Process_supervision) (like
|
||||
systemd or runit). It will ensure Tendermint is always running (despite
|
||||
possible errors).
|
||||
|
||||
Getting back to the original question, if your application dies,
|
||||
Tendermint will panic. After a process supervisor restarts your
|
||||
application, Tendermint should be able to reconnect successfully. The
|
||||
order of restart does not matter for it.
|
||||
|
||||
## Signal handling
|
||||
|
||||
We catch SIGINT and SIGTERM and try to clean up nicely. For other
|
||||
signals we use the default behavior in Go: [Default behavior of signals
|
||||
in Go
|
||||
programs](https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs).
|
||||
|
||||
## Corruption
|
||||
|
||||
**NOTE:** Make sure you have a backup of the Tendermint data directory.
|
||||
|
||||
### Possible causes
|
||||
|
||||
Remember that most corruption is caused by hardware issues:
|
||||
|
||||
- RAID controllers with faulty / worn out battery backup, and an unexpected power loss
|
||||
- Hard disk drives with write-back cache enabled, and an unexpected power loss
|
||||
- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
|
||||
- Defective RAM
|
||||
- Defective or overheating CPU(s)
|
||||
|
||||
Other causes can be:
|
||||
|
||||
- Database systems configured with fsync=off and an OS crash or power loss
|
||||
- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.
|
||||
- Tendermint bugs
|
||||
- Operating system bugs
|
||||
- Admin error (e.g., directly modifying Tendermint data-directory contents)
|
||||
|
||||
(Source: <https://wiki.postgresql.org/wiki/Corruption>)
|
||||
|
||||
### WAL Corruption
|
||||
|
||||
If consensus WAL is corrupted at the latest height and you are trying to start
|
||||
Tendermint, replay will fail with panic.
|
||||
|
||||
Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take:
|
||||
|
||||
1. Delete the WAL file and restart Tendermint. It will attempt to sync with other peers.
|
||||
2. Try to repair the WAL file manually:
|
||||
|
||||
1) Create a backup of the corrupted WAL file:
|
||||
|
||||
```sh
|
||||
cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup
|
||||
```
|
||||
|
||||
2) Use `./scripts/wal2json` to create a human-readable version:
|
||||
|
||||
```sh
|
||||
./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal
|
||||
```
|
||||
|
||||
3) Search for a "CORRUPTED MESSAGE" line.
|
||||
4) By looking at the previous message and the message after the corrupted one
|
||||
and looking at the logs, try to rebuild the message. If the consequent
|
||||
messages are marked as corrupted too (this may happen if length header
|
||||
got corrupted or some writes did not make it to the WAL ~ truncation),
|
||||
then remove all the lines starting from the corrupted one and restart
|
||||
Tendermint.
|
||||
|
||||
```sh
|
||||
$EDITOR /tmp/corrupted_wal
|
||||
```
|
||||
|
||||
5) After editing, convert this file back into binary form by running:
|
||||
|
||||
```sh
|
||||
./scripts/json2wal/json2wal /tmp/corrupted_wal $TMHOME/data/cs.wal/wal
|
||||
```
|
||||
|
||||
## Hardware
|
||||
|
||||
### Processor and Memory
|
||||
|
||||
While actual specs vary depending on the load and validators count, minimal
|
||||
requirements are:
|
||||
|
||||
- 1GB RAM
|
||||
- 25GB of disk space
|
||||
- 1.4 GHz CPU
|
||||
|
||||
SSD disks are preferable for applications with high transaction throughput.
|
||||
|
||||
Recommended:
|
||||
|
||||
- 2GB RAM
|
||||
- 100GB SSD
|
||||
- x64 2.0 GHz 2v CPU
|
||||
|
||||
While for now, Tendermint stores all the history and it may require significant
|
||||
disk space over time, we are planning to implement state syncing (See [this
|
||||
issue](https://github.com/tendermint/tendermint/issues/828)). So, storing all
|
||||
the past blocks will not be necessary.
|
||||
|
||||
### Validator signing on 32 bit architectures (or ARM)
|
||||
|
||||
Both our `ed25519` and `secp256k1` implementations require constant time
|
||||
`uint64` multiplication. Non-constant time crypto can (and has) leaked
|
||||
private keys on both `ed25519` and `secp256k1`. This doesn't exist in hardware
|
||||
on 32 bit x86 platforms ([source](https://bearssl.org/ctmul.html)), and it
|
||||
depends on the compiler to enforce that it is constant time. It's unclear at
|
||||
this point whenever the Golang compiler does this correctly for all
|
||||
implementations.
|
||||
|
||||
**We do not support nor recommend running a validator on 32 bit architectures OR
|
||||
the "VIA Nano 2000 Series", and the architectures in the ARM section rated
|
||||
"S-".**
|
||||
|
||||
### Operating Systems
|
||||
|
||||
Tendermint can be compiled for a wide range of operating systems thanks to Go
|
||||
language (the list of \$OS/\$ARCH pairs can be found
|
||||
[here](https://golang.org/doc/install/source#environment)).
|
||||
|
||||
While we do not favor any operation system, more secure and stable Linux server
|
||||
distributions (like Centos) should be preferred over desktop operation systems
|
||||
(like Mac OS).
|
||||
|
||||
### Miscellaneous
|
||||
|
||||
NOTE: if you are going to use Tendermint in a public domain, make sure
|
||||
you read [hardware recommendations](https://cosmos.network/validators) for a validator in the
|
||||
Cosmos network.
|
||||
|
||||
## Configuration parameters
|
||||
|
||||
- `p2p.flush-throttle-timeout`
|
||||
- `p2p.max-packet-msg-payload-size`
|
||||
- `p2p.send-rate`
|
||||
- `p2p.recv-rate`
|
||||
|
||||
If you are going to use Tendermint in a private domain and you have a
|
||||
private high-speed network among your peers, it makes sense to lower
|
||||
flush throttle timeout and increase other params.
|
||||
|
||||
```toml
|
||||
[p2p]
|
||||
send-rate=20000000 # 2MB/s
|
||||
recv-rate=20000000 # 2MB/s
|
||||
flush-throttle-timeout=10
|
||||
max-packet-msg-payload-size=10240 # 10KB
|
||||
```
|
||||
|
||||
- `mempool.recheck`
|
||||
|
||||
After every block, Tendermint rechecks every transaction left in the
|
||||
mempool to see if transactions committed in that block affected the
|
||||
application state, so some of the transactions left may become invalid.
|
||||
If that does not apply to your application, you can disable it by
|
||||
setting `mempool.recheck=false`.
|
||||
|
||||
- `mempool.broadcast`
|
||||
|
||||
Setting this to false will stop the mempool from relaying transactions
|
||||
to other peers until they are included in a block. It means only the
|
||||
peer you send the tx to will see it until it is included in a block.
|
||||
|
||||
- `consensus.skip-timeout-commit`
|
||||
|
||||
We want `skip-timeout-commit=false` when there is economics on the line
|
||||
because proposers should wait to hear for more votes. But if you don't
|
||||
care about that and want the fastest consensus, you can skip it. It will
|
||||
be kept false by default for public deployments (e.g. [Cosmos
|
||||
Hub](https://cosmos.network/intro/hub)) while for enterprise
|
||||
applications, setting it to true is not a problem.
|
||||
|
||||
- `consensus.peer-gossip-sleep-duration`
|
||||
|
||||
You can try to reduce the time your node sleeps before checking if
|
||||
theres something to send its peers.
|
||||
|
||||
- `consensus.timeout-commit`
|
||||
|
||||
You can also try lowering `timeout-commit` (time we sleep before
|
||||
proposing the next block).
|
||||
|
||||
- `p2p.addr-book-strict`
|
||||
|
||||
By default, Tendermint checks whenever a peer's address is routable before
|
||||
saving it to the address book. The address is considered as routable if the IP
|
||||
is [valid and within allowed
|
||||
ranges](https://github.com/tendermint/tendermint/blob/27bd1deabe4ba6a2d9b463b8f3e3f1e31b993e61/p2p/netaddress.go#L209).
|
||||
|
||||
This may not be the case for private or local networks, where your IP range is usually
|
||||
strictly limited and private. If that case, you need to set `addr-book-strict`
|
||||
to `false` (turn it off).
|
||||
|
||||
- `rpc.max-open-connections`
|
||||
|
||||
By default, the number of simultaneous connections is limited because most OS
|
||||
give you limited number of file descriptors.
|
||||
|
||||
If you want to accept greater number of connections, you will need to increase
|
||||
these limits.
|
||||
|
||||
[Sysctls to tune the system to be able to open more connections](https://github.com/satori-com/tcpkali/blob/master/doc/tcpkali.man.md#sysctls-to-tune-the-system-to-be-able-to-open-more-connections)
|
||||
|
||||
The process file limits must also be increased, e.g. via `ulimit -n 8192`.
|
||||
|
||||
...for N connections, such as 50k:
|
||||
|
||||
```md
|
||||
kern.maxfiles=10000+2*N # BSD
|
||||
kern.maxfilesperproc=100+2*N # BSD
|
||||
kern.ipc.maxsockets=10000+2*N # BSD
|
||||
fs.file-max=10000+2*N # Linux
|
||||
net.ipv4.tcp_max_orphans=N # Linux
|
||||
|
||||
# For load-generating clients.
|
||||
net.ipv4.ip_local_port_range="10000 65535" # Linux.
|
||||
net.inet.ip.portrange.first=10000 # BSD/Mac.
|
||||
net.inet.ip.portrange.last=65535 # (Enough for N < 55535)
|
||||
net.ipv4.tcp_tw_reuse=1 # Linux
|
||||
net.inet.tcp.maxtcptw=2*N # BSD
|
||||
|
||||
# If using netfilter on Linux:
|
||||
net.netfilter.nf_conntrack_max=N
|
||||
echo $((N/8)) > /sys/module/nf_conntrack/parameters/hashsize
|
||||
```
|
||||
|
||||
The similar option exists for limiting the number of gRPC connections -
|
||||
`rpc.grpc-max-open-connections`.
|
||||
@@ -1,8 +1,8 @@
|
||||
---
|
||||
order: 1
|
||||
parent:
|
||||
title: Tendermint Core
|
||||
order: 4
|
||||
title: System
|
||||
order: 5
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
@@ -1,394 +1,7 @@
|
||||
---
|
||||
order: 4
|
||||
order: false
|
||||
---
|
||||
|
||||
# Running in production
|
||||
# Running In Production
|
||||
|
||||
If you are building Tendermint from source for use in production, make sure to check out an appropriate Git tag instead of a branch.
|
||||
|
||||
## Database
|
||||
|
||||
By default, Tendermint uses the `syndtr/goleveldb` package for its in-process
|
||||
key-value database. If you want maximal performance, it may be best to install
|
||||
the real C-implementation of LevelDB and compile Tendermint to use that using
|
||||
`make build TENDERMINT_BUILD_OPTIONS=cleveldb`. See the [install
|
||||
instructions](../introduction/install.md) for details.
|
||||
|
||||
Tendermint keeps multiple distinct databases in the `$TMROOT/data`:
|
||||
|
||||
- `blockstore.db`: Keeps the entire blockchain - stores blocks,
|
||||
block commits, and block meta data, each indexed by height. Used to sync new
|
||||
peers.
|
||||
- `evidence.db`: Stores all verified evidence of misbehaviour.
|
||||
- `state.db`: Stores the current blockchain state (ie. height, validators,
|
||||
consensus params). Only grows if consensus params or validators change. Also
|
||||
used to temporarily store intermediate results during block processing.
|
||||
- `tx_index.db`: Indexes txs (and their results) by tx hash and by DeliverTx result events.
|
||||
|
||||
By default, Tendermint will only index txs by their hash and height, not by their DeliverTx
|
||||
result events. See [indexing transactions](../app-dev/indexing-transactions.md) for
|
||||
details.
|
||||
|
||||
Applications can expose block pruning strategies to the node operator. Please read the documentation of your application
|
||||
to find out more details.
|
||||
|
||||
Applications can use [state sync](state-sync.md) to help nodes bootstrap quickly.
|
||||
|
||||
## Logging
|
||||
|
||||
Default logging level (`log-level = "main:info,state:info,statesync:info,*:error"`) should suffice for
|
||||
normal operation mode. Read [this
|
||||
post](https://blog.cosmos.network/one-of-the-exciting-new-features-in-0-10-0-release-is-smart-log-level-flag-e2506b4ab756)
|
||||
for details on how to configure `log-level` config variable. Some of the
|
||||
modules can be found [here](../nodes/logging#list-of-modules). If
|
||||
you're trying to debug Tendermint or asked to provide logs with debug
|
||||
logging level, you can do so by running Tendermint with
|
||||
`--log-level="*:debug"`.
|
||||
|
||||
## Write Ahead Logs (WAL)
|
||||
|
||||
Tendermint uses write ahead logs for the consensus (`cs.wal`) and the mempool
|
||||
(`mempool.wal`). Both WALs have a max size of 1GB and are automatically rotated.
|
||||
|
||||
### Consensus WAL
|
||||
|
||||
The `consensus.wal` is used to ensure we can recover from a crash at any point
|
||||
in the consensus state machine.
|
||||
It writes all consensus messages (timeouts, proposals, block part, or vote)
|
||||
to a single file, flushing to disk before processing messages from its own
|
||||
validator. Since Tendermint validators are expected to never sign a conflicting vote, the
|
||||
WAL ensures we can always recover deterministically to the latest state of the consensus without
|
||||
using the network or re-signing any consensus messages.
|
||||
|
||||
If your `consensus.wal` is corrupted, see [below](#wal-corruption).
|
||||
|
||||
### Mempool WAL
|
||||
|
||||
The `mempool.wal` logs all incoming txs before running CheckTx, but is
|
||||
otherwise not used in any programmatic way. It's just a kind of manual
|
||||
safe guard. Note the mempool provides no durability guarantees - a tx sent to one or many nodes
|
||||
may never make it into the blockchain if those nodes crash before being able to
|
||||
propose it. Clients must monitor their txs by subscribing over websockets,
|
||||
polling for them, or using `/broadcast_tx_commit`. In the worst case, txs can be
|
||||
resent from the mempool WAL manually.
|
||||
|
||||
For the above reasons, the `mempool.wal` is disabled by default. To enable, set
|
||||
`mempool.wal-dir` to where you want the WAL to be located (e.g.
|
||||
`data/mempool.wal`).
|
||||
|
||||
## DOS Exposure and Mitigation
|
||||
|
||||
Validators are supposed to setup [Sentry Node
|
||||
Architecture](./validators.md)
|
||||
to prevent Denial-of-service attacks.
|
||||
|
||||
### P2P
|
||||
|
||||
The core of the Tendermint peer-to-peer system is `MConnection`. Each
|
||||
connection has `MaxPacketMsgPayloadSize`, which is the maximum packet
|
||||
size and bounded send & receive queues. One can impose restrictions on
|
||||
send & receive rate per connection (`SendRate`, `RecvRate`).
|
||||
|
||||
The number of open P2P connections can become quite large, and hit the operating system's open
|
||||
file limit (since TCP connections are considered files on UNIX-based systems). Nodes should be
|
||||
given a sizable open file limit, e.g. 8192, via `ulimit -n 8192` or other deployment-specific
|
||||
mechanisms.
|
||||
|
||||
### RPC
|
||||
|
||||
Endpoints returning multiple entries are limited by default to return 30
|
||||
elements (100 max). See the [RPC Documentation](https://docs.tendermint.com/master/rpc/)
|
||||
for more information.
|
||||
|
||||
Rate-limiting and authentication are another key aspects to help protect
|
||||
against DOS attacks. Validators are supposed to use external tools like
|
||||
[NGINX](https://www.nginx.com/blog/rate-limiting-nginx/) or
|
||||
[traefik](https://docs.traefik.io/middlewares/ratelimit/)
|
||||
to achieve the same things.
|
||||
|
||||
## Debugging Tendermint
|
||||
|
||||
If you ever have to debug Tendermint, the first thing you should probably do is
|
||||
check out the logs. See [Logging](../nodes/logging.md), where we
|
||||
explain what certain log statements mean.
|
||||
|
||||
If, after skimming through the logs, things are not clear still, the next thing
|
||||
to try is querying the `/status` RPC endpoint. It provides the necessary info:
|
||||
whenever the node is syncing or not, what height it is on, etc.
|
||||
|
||||
```bash
|
||||
curl http(s)://{ip}:{rpcPort}/status
|
||||
```
|
||||
|
||||
`/dump_consensus_state` will give you a detailed overview of the consensus
|
||||
state (proposer, latest validators, peers states). From it, you should be able
|
||||
to figure out why, for example, the network had halted.
|
||||
|
||||
```bash
|
||||
curl http(s)://{ip}:{rpcPort}/dump_consensus_state
|
||||
```
|
||||
|
||||
There is a reduced version of this endpoint - `/consensus_state`, which returns
|
||||
just the votes seen at the current height.
|
||||
|
||||
If, after consulting with the logs and above endpoints, you still have no idea
|
||||
what's happening, consider using `tendermint debug kill` sub-command. This
|
||||
command will scrap all the available info and kill the process. See
|
||||
[Debugging](../tools/debugging.md) for the exact format.
|
||||
|
||||
You can inspect the resulting archive yourself or create an issue on
|
||||
[Github](https://github.com/tendermint/tendermint). Before opening an issue
|
||||
however, be sure to check if there's [no existing
|
||||
issue](https://github.com/tendermint/tendermint/issues) already.
|
||||
|
||||
## Monitoring Tendermint
|
||||
|
||||
Each Tendermint instance has a standard `/health` RPC endpoint, which responds
|
||||
with 200 (OK) if everything is fine and 500 (or no response) - if something is
|
||||
wrong.
|
||||
|
||||
Other useful endpoints include mentioned earlier `/status`, `/net_info` and
|
||||
`/validators`.
|
||||
|
||||
Tendermint also can report and serve Prometheus metrics. See
|
||||
[Metrics](./metrics.md).
|
||||
|
||||
`tendermint debug dump` sub-command can be used to periodically dump useful
|
||||
information into an archive. See [Debugging](../tools/debugging.md) for more
|
||||
information.
|
||||
|
||||
## What happens when my app dies
|
||||
|
||||
You are supposed to run Tendermint under a [process
|
||||
supervisor](https://en.wikipedia.org/wiki/Process_supervision) (like
|
||||
systemd or runit). It will ensure Tendermint is always running (despite
|
||||
possible errors).
|
||||
|
||||
Getting back to the original question, if your application dies,
|
||||
Tendermint will panic. After a process supervisor restarts your
|
||||
application, Tendermint should be able to reconnect successfully. The
|
||||
order of restart does not matter for it.
|
||||
|
||||
## Signal handling
|
||||
|
||||
We catch SIGINT and SIGTERM and try to clean up nicely. For other
|
||||
signals we use the default behavior in Go: [Default behavior of signals
|
||||
in Go
|
||||
programs](https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs).
|
||||
|
||||
## Corruption
|
||||
|
||||
**NOTE:** Make sure you have a backup of the Tendermint data directory.
|
||||
|
||||
### Possible causes
|
||||
|
||||
Remember that most corruption is caused by hardware issues:
|
||||
|
||||
- RAID controllers with faulty / worn out battery backup, and an unexpected power loss
|
||||
- Hard disk drives with write-back cache enabled, and an unexpected power loss
|
||||
- Cheap SSDs with insufficient power-loss protection, and an unexpected power-loss
|
||||
- Defective RAM
|
||||
- Defective or overheating CPU(s)
|
||||
|
||||
Other causes can be:
|
||||
|
||||
- Database systems configured with fsync=off and an OS crash or power loss
|
||||
- Filesystems configured to use write barriers plus a storage layer that ignores write barriers. LVM is a particular culprit.
|
||||
- Tendermint bugs
|
||||
- Operating system bugs
|
||||
- Admin error (e.g., directly modifying Tendermint data-directory contents)
|
||||
|
||||
(Source: <https://wiki.postgresql.org/wiki/Corruption>)
|
||||
|
||||
### WAL Corruption
|
||||
|
||||
If consensus WAL is corrupted at the latest height and you are trying to start
|
||||
Tendermint, replay will fail with panic.
|
||||
|
||||
Recovering from data corruption can be hard and time-consuming. Here are two approaches you can take:
|
||||
|
||||
1. Delete the WAL file and restart Tendermint. It will attempt to sync with other peers.
|
||||
2. Try to repair the WAL file manually:
|
||||
|
||||
1) Create a backup of the corrupted WAL file:
|
||||
|
||||
```sh
|
||||
cp "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal_backup
|
||||
```
|
||||
|
||||
2) Use `./scripts/wal2json` to create a human-readable version:
|
||||
|
||||
```sh
|
||||
./scripts/wal2json/wal2json "$TMHOME/data/cs.wal/wal" > /tmp/corrupted_wal
|
||||
```
|
||||
|
||||
3) Search for a "CORRUPTED MESSAGE" line.
|
||||
4) By looking at the previous message and the message after the corrupted one
|
||||
and looking at the logs, try to rebuild the message. If the consequent
|
||||
messages are marked as corrupted too (this may happen if length header
|
||||
got corrupted or some writes did not make it to the WAL ~ truncation),
|
||||
then remove all the lines starting from the corrupted one and restart
|
||||
Tendermint.
|
||||
|
||||
```sh
|
||||
$EDITOR /tmp/corrupted_wal
|
||||
```
|
||||
|
||||
5) After editing, convert this file back into binary form by running:
|
||||
|
||||
```sh
|
||||
./scripts/json2wal/json2wal /tmp/corrupted_wal $TMHOME/data/cs.wal/wal
|
||||
```
|
||||
|
||||
## Hardware
|
||||
|
||||
### Processor and Memory
|
||||
|
||||
While actual specs vary depending on the load and validators count, minimal
|
||||
requirements are:
|
||||
|
||||
- 1GB RAM
|
||||
- 25GB of disk space
|
||||
- 1.4 GHz CPU
|
||||
|
||||
SSD disks are preferable for applications with high transaction throughput.
|
||||
|
||||
Recommended:
|
||||
|
||||
- 2GB RAM
|
||||
- 100GB SSD
|
||||
- x64 2.0 GHz 2v CPU
|
||||
|
||||
While for now, Tendermint stores all the history and it may require significant
|
||||
disk space over time, we are planning to implement state syncing (See [this
|
||||
issue](https://github.com/tendermint/tendermint/issues/828)). So, storing all
|
||||
the past blocks will not be necessary.
|
||||
|
||||
### Validator signing on 32 bit architectures (or ARM)
|
||||
|
||||
Both our `ed25519` and `secp256k1` implementations require constant time
|
||||
`uint64` multiplication. Non-constant time crypto can (and has) leaked
|
||||
private keys on both `ed25519` and `secp256k1`. This doesn't exist in hardware
|
||||
on 32 bit x86 platforms ([source](https://bearssl.org/ctmul.html)), and it
|
||||
depends on the compiler to enforce that it is constant time. It's unclear at
|
||||
this point whenever the Golang compiler does this correctly for all
|
||||
implementations.
|
||||
|
||||
**We do not support nor recommend running a validator on 32 bit architectures OR
|
||||
the "VIA Nano 2000 Series", and the architectures in the ARM section rated
|
||||
"S-".**
|
||||
|
||||
### Operating Systems
|
||||
|
||||
Tendermint can be compiled for a wide range of operating systems thanks to Go
|
||||
language (the list of \$OS/\$ARCH pairs can be found
|
||||
[here](https://golang.org/doc/install/source#environment)).
|
||||
|
||||
While we do not favor any operation system, more secure and stable Linux server
|
||||
distributions (like Centos) should be preferred over desktop operation systems
|
||||
(like Mac OS).
|
||||
|
||||
### Miscellaneous
|
||||
|
||||
NOTE: if you are going to use Tendermint in a public domain, make sure
|
||||
you read [hardware recommendations](https://cosmos.network/validators) for a validator in the
|
||||
Cosmos network.
|
||||
|
||||
## Configuration parameters
|
||||
|
||||
- `p2p.flush-throttle-timeout`
|
||||
- `p2p.max-packet-msg-payload-size`
|
||||
- `p2p.send-rate`
|
||||
- `p2p.recv-rate`
|
||||
|
||||
If you are going to use Tendermint in a private domain and you have a
|
||||
private high-speed network among your peers, it makes sense to lower
|
||||
flush throttle timeout and increase other params.
|
||||
|
||||
```toml
|
||||
[p2p]
|
||||
send-rate=20000000 # 2MB/s
|
||||
recv-rate=20000000 # 2MB/s
|
||||
flush-throttle-timeout=10
|
||||
max-packet-msg-payload-size=10240 # 10KB
|
||||
```
|
||||
|
||||
- `mempool.recheck`
|
||||
|
||||
After every block, Tendermint rechecks every transaction left in the
|
||||
mempool to see if transactions committed in that block affected the
|
||||
application state, so some of the transactions left may become invalid.
|
||||
If that does not apply to your application, you can disable it by
|
||||
setting `mempool.recheck=false`.
|
||||
|
||||
- `mempool.broadcast`
|
||||
|
||||
Setting this to false will stop the mempool from relaying transactions
|
||||
to other peers until they are included in a block. It means only the
|
||||
peer you send the tx to will see it until it is included in a block.
|
||||
|
||||
- `consensus.skip-timeout-commit`
|
||||
|
||||
We want `skip-timeout-commit=false` when there is economics on the line
|
||||
because proposers should wait to hear for more votes. But if you don't
|
||||
care about that and want the fastest consensus, you can skip it. It will
|
||||
be kept false by default for public deployments (e.g. [Cosmos
|
||||
Hub](https://cosmos.network/intro/hub)) while for enterprise
|
||||
applications, setting it to true is not a problem.
|
||||
|
||||
- `consensus.peer-gossip-sleep-duration`
|
||||
|
||||
You can try to reduce the time your node sleeps before checking if
|
||||
theres something to send its peers.
|
||||
|
||||
- `consensus.timeout-commit`
|
||||
|
||||
You can also try lowering `timeout-commit` (time we sleep before
|
||||
proposing the next block).
|
||||
|
||||
- `p2p.addr-book-strict`
|
||||
|
||||
By default, Tendermint checks whenever a peer's address is routable before
|
||||
saving it to the address book. The address is considered as routable if the IP
|
||||
is [valid and within allowed
|
||||
ranges](https://github.com/tendermint/tendermint/blob/27bd1deabe4ba6a2d9b463b8f3e3f1e31b993e61/p2p/netaddress.go#L209).
|
||||
|
||||
This may not be the case for private or local networks, where your IP range is usually
|
||||
strictly limited and private. If that case, you need to set `addr-book-strict`
|
||||
to `false` (turn it off).
|
||||
|
||||
- `rpc.max-open-connections`
|
||||
|
||||
By default, the number of simultaneous connections is limited because most OS
|
||||
give you limited number of file descriptors.
|
||||
|
||||
If you want to accept greater number of connections, you will need to increase
|
||||
these limits.
|
||||
|
||||
[Sysctls to tune the system to be able to open more connections](https://github.com/satori-com/tcpkali/blob/master/doc/tcpkali.man.md#sysctls-to-tune-the-system-to-be-able-to-open-more-connections)
|
||||
|
||||
The process file limits must also be increased, e.g. via `ulimit -n 8192`.
|
||||
|
||||
...for N connections, such as 50k:
|
||||
|
||||
```md
|
||||
kern.maxfiles=10000+2*N # BSD
|
||||
kern.maxfilesperproc=100+2*N # BSD
|
||||
kern.ipc.maxsockets=10000+2*N # BSD
|
||||
fs.file-max=10000+2*N # Linux
|
||||
net.ipv4.tcp_max_orphans=N # Linux
|
||||
|
||||
# For load-generating clients.
|
||||
net.ipv4.ip_local_port_range="10000 65535" # Linux.
|
||||
net.inet.ip.portrange.first=10000 # BSD/Mac.
|
||||
net.inet.ip.portrange.last=65535 # (Enough for N < 55535)
|
||||
net.ipv4.tcp_tw_reuse=1 # Linux
|
||||
net.inet.tcp.maxtcptw=2*N # BSD
|
||||
|
||||
# If using netfilter on Linux:
|
||||
net.netfilter.nf_conntrack_max=N
|
||||
echo $((N/8)) > /sys/module/nf_conntrack/parameters/hashsize
|
||||
```
|
||||
|
||||
The similar option exists for limiting the number of gRPC connections -
|
||||
`rpc.grpc-max-open-connections`.
|
||||
This file has moved to the [nodes section](../nodes/running-in-production.md).
|
||||
@@ -2,7 +2,7 @@
|
||||
order: 1
|
||||
parent:
|
||||
title: Tooling
|
||||
order: 6
|
||||
order: 8
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
Reference in New Issue
Block a user