QA Process report for v0.37.x (and baseline for v0.34.x) (backport #9499) (#9578)

* QA Process report for v0.37.x (and baseline for v0.34.x) (#9499)

* 1st version. 200 nodes. Missing rotating node

* Small fixes

* Addressed @jmalicevic's comment

* Explain in method how to set the tmint version to test. Improve result section

* 1st version of how to run the 'rotating node' testnet

* Apply suggestions from @williambanfield

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>

* Addressed @williambanfield's comments

* Added reference to Unix load metric

* Added total TXs

* Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots

* Report for rotating node

* Adressed remaining comments from @williambanfield

* Cosmetic

* Addressed some of @thanethomson's comments

* Re-executed the 200 node tests and updated the corresponding sections of the report

* Ignore Python virtualenv directories

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Add latency vs throughput script

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Add README for latency vs throughput script

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Fix local links to folders

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* v034: only have one level-1 heading

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Adjust headings

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* v0.37.x: add links to issues/PRs

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* v0.37.x: add note about bug being present in v0.34

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* method: adjust heading depths

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Show data points on latency vs throughput plot

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Add latency vs throughput plots

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Correct mentioning of v0.34.21 and add heading

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Refactor latency vs throughput script

Update the latency vs throughput script to rather generate plots from
the "raw" CSV output from the loadtime reporting tool as opposed to the
separated CSV files from the experimental method.

Also update the relevant documentation, and regenerate the images from
the raw CSV data (resulting in pretty much the same plots as the
previous ones).

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Remove unused default duration const

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Adjust experiment start time to be more accurate and re-plot latency vs throughput

Signed-off-by: Thane Thomson <connect@thanethomson.com>

* Addressed @williambanfield's comments

* Apply suggestions from code review

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>

* scripts: Update latency vs throughput readme for clarity

Signed-off-by: Thane Thomson <connect@thanethomson.com>

Signed-off-by: Thane Thomson <connect@thanethomson.com>
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Thane Thomson <connect@thanethomson.com>
(cherry picked from commit b06e1cea54)

* Remove v037 dir

* Removed reference to v0.37 testnets

Co-authored-by: Sergio Mena <sergio@informal.systems>
This commit is contained in:
mergify[bot]
2022-10-18 06:26:07 -04:00
committed by GitHub
parent 3ec6e424d6
commit dbf22de42b
29 changed files with 797 additions and 0 deletions

278
docs/qa/v034/README.md Normal file
View File

@@ -0,0 +1,278 @@
---
order: 1
parent:
title: Tendermint Quality Assurance Results for v0.34.x
description: This is a report on the results obtained when running v0.34.x on testnets
order: 2
---
# v0.34.x
## 200 Node Testnet
### Finding the Saturation Point
The first goal when examining the results of the tests is identifying the saturation point.
The saturation point is a setup with a transaction load big enough to prevent the testnet
from being stable: the load runner tries to produce slightly more transactions than can
be processed by the testnet.
The following table summarizes the results for v0.34.x, for the different experiments
(extracted from file [`v034_report_tabbed.txt`](./img/v034_report_tabbed.txt)).
The X axis of this table is `c`, the number of connections created by the load runner process to the target node.
The Y axis of this table is `r`, the rate or number of transactions issued per second.
| | c=1 | c=2 | c=4 |
| :--- | ----: | ----: | ----: |
| r=25 | 2225 | 4450 | 8900 |
| r=50 | 4450 | 8900 | 17800 |
| r=100 | 8900 | 17800 | 35600 |
| r=200 | 17800 | 35600 | 38660 |
The table shows the number of 1024-byte-long transactions that were produced by the load runner,
and processed by Tendermint, during the 90 seconds of the experiment's duration.
Each cell in the table refers to an experiment with a particular number of websocket connections (`c`)
to a chosen validator, and the number of transactions per second that the load runner
tries to produce (`r`). Note that the overall load that the tool attempts to generate is $c \cdot r$.
We can see that the saturation point is beyond the diagonal that spans cells
* `r=200,c=2`
* `r=100,c=4`
given that the total transactions should be close to the product of the rate, the number of connections,
and the experiment time (89 seconds, since the last batch never gets sent).
All experiments below the saturation diagonal (`r=200,c=4`) have in common that the total
number of transactions processed is noticeably less than the product $c \cdot r \cdot 89$,
which is the expected number of transactions when the system is able to deal well with the
load.
With `r=200,c=4`, we obtained 38660 whereas the theoretical number of transactions should
have been $200 \cdot 4 \cdot 89 = 71200$.
At this point, we chose an experiment at the limit of the saturation diagonal,
in order to further study the performance of this release.
**The chosen experiment is `r=200,c=2`**.
This is a plot of the CPU load (average over 1 minute, as output by `top`) of the load runner for `r=200,c=2`,
where we can see that the load stays close to 0 most of the time.
![load-load-runner](./img/v034_r200c2_load-runner.png)
### Examining latencies
The method described [here](../method.md) allows us to plot the latencies of transactions
for all experiments.
![all-latencies](./img/v034_200node_latencies.png)
As we can see, even the experiments beyond the saturation diagonal managed to keep
transaction latency stable (i.e. not constantly increasing).
Our interpretation for this is that contention within Tendermint was propagated,
via the websockets, to the load runner,
hence the load runner could not produce the target load, but a fraction of it.
Further examination of the Prometheus data (see below), showed that the mempool contained many transactions
at steady state, but did not grow much without quickly returning to this steady state. This demonstrates
that the transactions were able to be processed by the Tendermint network at least as quickly as they
were submitted to the mempool. Finally, the test script made sure that, at the end of an experiment, the
mempool was empty so that all transactions submitted to the chain were processed.
Finally, the number of points present in the plot appears to be much less than expected given the
number of transactions in each experiment, particularly close to or above the saturation diagonal.
This is a visual effect of the plot; what appear to be points in the plot are actually potentially huge
clusters of points. To corroborate this, we have zoomed in the plot above by setting (carefully chosen)
tiny axis intervals. The cluster shown below looks like a single point in the plot above.
![all-latencies-zoomed](./img/v034_200node_latencies_zoomed.png)
The plot of latencies can we used as a baseline to compare with other releases.
The following plot summarizes average latencies versus overall throughputs
across different numbers of WebSocket connections to the node into which
transactions are being loaded.
![latency-vs-throughput](./img/v034_latency_throughput.png)
### Prometheus Metrics on the Chosen Experiment
As mentioned [above](#finding-the-saturation-point), the chosen experiment is `r=200,c=2`.
This section further examines key metrics for this experiment extracted from Prometheus data.
#### Mempool Size
The mempool size, a count of the number of transactions in the mempool, was shown to be stable and homogeneous
at all full nodes. It did not exhibit any unconstrained growth.
The plot below shows the evolution over time of the cumulative number of transactions inside all full nodes' mempools
at a given time.
The two spikes that can be observed correspond to a period where consensus instances proceeded beyond the initial round
at some nodes.
![mempool-cumulative](./img/v034_r200c2_mempool_size.png)
The plot below shows evolution of the average over all full nodes, which oscillates between 1500 and 2000
outstanding transactions.
![mempool-avg](./img/v034_r200c2_mempool_size_avg.png)
The peaks observed coincide with the moments when some nodes proceeded beyond the initial round of consensus (see below).
#### Peers
The number of peers was stable at all nodes.
It was higher for the seed nodes (around 140) than for the rest (between 21 and 74).
The fact that non-seed nodes reach more than 50 peers is due to #9548.
![peers](./img/v034_r200c2_peers.png)
#### Consensus Rounds per Height
Most heights took just one round, but some nodes needed to advance to round 1 at some point.
![rounds](./img/v034_r200c2_rounds.png)
#### Blocks Produced per Minute, Transactions Processed per Minute
The blocks produced per minute are the slope of this plot.
![heights](./img/v034_r200c2_heights.png)
Over a period of 2 minutes, the height goes from 530 to 569.
This results in an average of 19.5 blocks produced per minute.
The transactions processed per minute are the slope of this plot.
![total-txs](./img/v034_r200c2_total-txs.png)
Over a period of 2 minutes, the total goes from 64525 to 100125 transactions,
resulting in 17800 transactions per minute. However, we can see in the plot that
all transactions in the load are processed long before the two minutes.
If we adjust the time window when transactions are processed (approx. 105 seconds),
we obtain 20343 transactions per minute.
#### Memory Resident Set Size
Resident Set Size of all monitored processes is plotted below.
![rss](./img/v034_r200c2_rss.png)
The average over all processes oscillates around 1.2 GiB and does not demonstrate unconstrained growth.
![rss-avg](./img/v034_r200c2_rss_avg.png)
#### CPU utilization
The best metric from Prometheus to gauge CPU utilization in a Unix machine is `load1`,
as it usually appears in the
[output of `top`](https://www.digitalocean.com/community/tutorials/load-average-in-linux).
![load1](./img/v034_r200c2_load1.png)
It is contained in most cases below 5, which is generally considered acceptable load.
### Test Result
**Result: N/A** (v0.34.x is the baseline)
Date: 2022-10-14
Version: 3ec6e424d6ae4c96867c2dcf8310572156068bb6
## Rotating Node Testnet
For this testnet, we will use a load that can safely be considered below the saturation
point for the size of this testnet (between 13 and 38 full nodes): `c=4,r=800`.
N.B.: The version of Tendermint used for these tests is affected by #9539.
However, the reduced load that reaches the mempools is orthogonal to functionality
we are focusing on here.
### Latencies
The plot of all latencies can be seen in the following plot.
![rotating-all-latencies](./img/v034_rotating_latencies.png)
We can observe there are some very high latencies, towards the end of the test.
Upon suspicion that they are duplicate transactions, we examined the latencies
raw file and discovered there are more than 100K duplicate transactions.
The following plot shows the latencies file where all duplicate transactions have
been removed, i.e., only the first occurrence of a duplicate transaction is kept.
![rotating-all-latencies-uniq](./img/v034_rotating_latencies_uniq.png)
This problem, existing in `v0.34.x`, will need to be addressed, perhaps in the same way
we addressed it when running the 200 node test with high loads: increasing the `cache_size`
configuration parameter.
### Prometheus Metrics
The set of metrics shown here are less than for the 200 node experiment.
We are only interested in those for which the catch-up process (blocksync) may have an impact.
#### Blocks and Transactions per minute
Just as shown for the 200 node test, the blocks produced per minute are the gradient of this plot.
![rotating-heights](./img/v034_rotating_heights.png)
Over a period of 5229 seconds, the height goes from 2 to 3638.
This results in an average of 41 blocks produced per minute.
The following plot shows only the heights reported by ephemeral nodes
(which are also included in the plot above). Note that the _height_ metric
is only showed _once the node has switched to consensus_, hence the gaps
when nodes are killed, wiped out, started from scratch, and catching up.
![rotating-heights-ephe](./img/v034_rotating_heights_ephe.png)
The transactions processed per minute are the gradient of this plot.
![rotating-total-txs](./img/v034_rotating_total-txs.png)
The small lines we see periodically close to `y=0` are the transactions that
ephemeral nodes start processing when they are caught up.
Over a period of 5229 minutes, the total goes from 0 to 387697 transactions,
resulting in 4449 transactions per minute. We can see some abrupt changes in
the plot's gradient. This will need to be investigated.
#### Peers
The plot below shows the evolution in peers throughout the experiment.
The periodic changes observed are due to the ephemeral nodes being stopped,
wiped out, and recreated.
![rotating-peers](./img/v034_rotating_peers.png)
The validators' plots are concentrated at the higher part of the graph, whereas the ephemeral nodes
are mostly at the lower part.
#### Memory Resident Set Size
The average Resident Set Size (RSS) over all processes seems stable, and slightly growing toward the end.
This might be related to the increased in transaction load observed above.
![rotating-rss-avg](./img/v034_rotating_rss_avg.png)
The memory taken by the validators and the ephemeral nodes (when they are up) is comparable.
#### CPU utilization
The plot shows metric `load1` for all nodes.
![rotating-load1](./img/v034_rotating_load1.png)
It is contained under 5 most of the time, which is considered normal load.
The purple line, which follows a different pattern is the validator receiving all
transactions, via RPC, from the load runner process.
### Test Result
**Result: N/A**
Date: 2022-10-10
Version: a28c987f5a604ff66b515dd415270063e6fb069d

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 378 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 150 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 759 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 192 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 926 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 157 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 534 KiB

View File

@@ -0,0 +1,52 @@
Experiment ID: 3d5cf4ef-1a1a-4b46-aa2d-da5643d2e81e │Experiment ID: 80e472ec-13a1-4772-a827-3b0c907fb51d │Experiment ID: 07aca6cf-c5a4-4696-988f-e3270fc6333b
│ │
Connections: 1 │ Connections: 2 │ Connections: 4
Rate: 25 │ Rate: 25 │ Rate: 25
Size: 1024 │ Size: 1024 │ Size: 1024
│ │
Total Valid Tx: 2225 │ Total Valid Tx: 4450 │ Total Valid Tx: 8900
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
Minimum Latency: 599.404362ms │ Minimum Latency: 448.145181ms │ Minimum Latency: 412.485729ms
Maximum Latency: 3.539686885s │ Maximum Latency: 3.237392049s │ Maximum Latency: 12.026665368s
Average Latency: 1.441485349s │ Average Latency: 1.441267946s │ Average Latency: 2.150192457s
Standard Deviation: 541.049869ms │ Standard Deviation: 525.040007ms │ Standard Deviation: 2.233852478s
│ │
Experiment ID: 953dc544-dd40-40e8-8712-20c34c3ce45e │Experiment ID: d31fc258-16e7-45cd-9dc8-13ab87bc0b0a │Experiment ID: 15d90a7e-b941-42f4-b411-2f15f857739e
│ │
Connections: 1 │ Connections: 2 │ Connections: 4
Rate: 50 │ Rate: 50 │ Rate: 50
Size: 1024 │ Size: 1024 │ Size: 1024
│ │
Total Valid Tx: 4450 │ Total Valid Tx: 8900 │ Total Valid Tx: 17800
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
Minimum Latency: 482.046942ms │ Minimum Latency: 435.458913ms │ Minimum Latency: 510.746448ms
Maximum Latency: 3.761483455s │ Maximum Latency: 7.175583584s │ Maximum Latency: 6.551497882s
Average Latency: 1.450408183s │ Average Latency: 1.681673116s │ Average Latency: 1.738083875s
Standard Deviation: 587.560056ms │ Standard Deviation: 1.147902047s │ Standard Deviation: 943.46522ms
│ │
Experiment ID: 9a0b9980-9ce6-4db5-a80a-65ca70294b87 │Experiment ID: df8fa4f4-80af-4ded-8a28-356d15018b43 │Experiment ID: d0e41c2c-89c0-4f38-8e34-ca07adae593a
│ │
Connections: 1 │ Connections: 2 │ Connections: 4
Rate: 100 │ Rate: 100 │ Rate: 100
Size: 1024 │ Size: 1024 │ Size: 1024
│ │
Total Valid Tx: 8900 │ Total Valid Tx: 17800 │ Total Valid Tx: 35600
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
Minimum Latency: 477.417219ms │ Minimum Latency: 564.29247ms │ Minimum Latency: 840.71089ms
Maximum Latency: 6.63744785s │ Maximum Latency: 6.988553219s │ Maximum Latency: 9.555312398s
Average Latency: 1.561216103s │ Average Latency: 1.76419063s │ Average Latency: 3.200941683s
Standard Deviation: 1.011333552s │ Standard Deviation: 1.068459423s │ Standard Deviation: 1.732346601s
│ │
Experiment ID: 493df3ee-4a36-4bce-80f8-6d65da66beda │Experiment ID: 13060525-f04f-46f6-8ade-286684b2fe50 │Experiment ID: 1777cbd2-8c96-42e4-9ec7-9b21f2225e4d
│ │
Connections: 1 │ Connections: 2 │ Connections: 4
Rate: 200 │ Rate: 200 │ Rate: 200
Size: 1024 │ Size: 1024 │ Size: 1024
│ │
Total Valid Tx: 17800 │ Total Valid Tx: 35600 │ Total Valid Tx: 38660
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
Minimum Latency: 493.705261ms │ Minimum Latency: 955.090573ms │ Minimum Latency: 1.9485821s
Maximum Latency: 7.440921872s │ Maximum Latency: 10.086673491s │ Maximum Latency: 17.73103976s
Average Latency: 1.875510582s │ Average Latency: 3.438130099s │ Average Latency: 8.143862237s
Standard Deviation: 1.304336995s │ Standard Deviation: 1.966391574s │ Standard Deviation: 3.943140002s

Binary file not shown.

After

Width:  |  Height:  |  Size: 157 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 140 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 486 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 193 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 197 KiB