* QA Process report for v0.37.x (and baseline for v0.34.x) (#9499)
* 1st version. 200 nodes. Missing rotating node
* Small fixes
* Addressed @jmalicevic's comment
* Explain in method how to set the tmint version to test. Improve result section
* 1st version of how to run the 'rotating node' testnet
* Apply suggestions from @williambanfield
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
* Addressed @williambanfield's comments
* Added reference to Unix load metric
* Added total TXs
* Fixed some 'png's that got swapped. Excluded '.*-node-exporter' processes from memory plots
* Report for rotating node
* Adressed remaining comments from @williambanfield
* Cosmetic
* Addressed some of @thanethomson's comments
* Re-executed the 200 node tests and updated the corresponding sections of the report
* Ignore Python virtualenv directories
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Add latency vs throughput script
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Add README for latency vs throughput script
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Fix local links to folders
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* v034: only have one level-1 heading
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Adjust headings
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* v0.37.x: add links to issues/PRs
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* v0.37.x: add note about bug being present in v0.34
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* method: adjust heading depths
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Show data points on latency vs throughput plot
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Add latency vs throughput plots
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Correct mentioning of v0.34.21 and add heading
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Refactor latency vs throughput script
Update the latency vs throughput script to rather generate plots from
the "raw" CSV output from the loadtime reporting tool as opposed to the
separated CSV files from the experimental method.
Also update the relevant documentation, and regenerate the images from
the raw CSV data (resulting in pretty much the same plots as the
previous ones).
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Remove unused default duration const
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Adjust experiment start time to be more accurate and re-plot latency vs throughput
Signed-off-by: Thane Thomson <connect@thanethomson.com>
* Addressed @williambanfield's comments
* Apply suggestions from code review
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
* scripts: Update latency vs throughput readme for clarity
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Signed-off-by: Thane Thomson <connect@thanethomson.com>
Co-authored-by: William Banfield <4561443+williambanfield@users.noreply.github.com>
Co-authored-by: Thane Thomson <connect@thanethomson.com>
(cherry picked from commit b06e1cea54)
* Remove v037 dir
* Removed reference to v0.37 testnets
Co-authored-by: Sergio Mena <sergio@informal.systems>
2
.gitignore
vendored
@@ -65,3 +65,5 @@ test/fuzz/**/*.zip
|
||||
*.pdf
|
||||
*.gz
|
||||
*.dvi
|
||||
# Python virtual environments
|
||||
.venv
|
||||
|
||||
22
docs/qa/README.md
Normal file
@@ -0,0 +1,22 @@
|
||||
---
|
||||
order: 1
|
||||
parent:
|
||||
title: Tendermint Quality Assurance
|
||||
description: This is a report on the process followed and results obtained when running v0.34.x on testnets
|
||||
order: 2
|
||||
---
|
||||
|
||||
# Tendermint Quality Assurance
|
||||
|
||||
This directory keeps track of the process followed by the Tendermint Core team
|
||||
for Quality Assurance before cutting a release.
|
||||
This directory is to live in multiple branches. On each release branch,
|
||||
the contents of this directory reflect the status of the process
|
||||
at the time the Quality Assurance process was applied for that release.
|
||||
|
||||
File [method](./method.md) keeps track of the process followed to obtain the results
|
||||
used to decide if a release is passing the Quality Assurance process.
|
||||
The results obtained in each release are stored in their own directory.
|
||||
The following releases have undergone the Quality Assurance process:
|
||||
|
||||
* [v0.34.x](./v034/), which was tested just before releasing v0.34.22
|
||||
214
docs/qa/method.md
Normal file
@@ -0,0 +1,214 @@
|
||||
---
|
||||
order: 1
|
||||
title: Method
|
||||
---
|
||||
|
||||
# Method
|
||||
|
||||
This document provides a detailed description of the QA process.
|
||||
It is intended to be used by engineers reproducing the experimental setup for future tests of Tendermint.
|
||||
|
||||
The (first iteration of the) QA process as described [in the RELEASES.md document][releases]
|
||||
was applied to version v0.34.x in order to have a set of results acting as benchmarking baseline.
|
||||
This baseline is then compared with results obtained in later versions.
|
||||
|
||||
Out of the testnet-based test cases described in [the releases document][releases] we focused on two of them:
|
||||
_200 Node Test_, and _Rotating Nodes Test_.
|
||||
|
||||
[releases]: https://github.com/tendermint/tendermint/blob/v0.37.x/RELEASES.md#large-scale-testnets
|
||||
|
||||
## Software Dependencies
|
||||
|
||||
### Infrastructure Requirements to Run the Tests
|
||||
|
||||
* An account at Digital Ocean (DO), with a high droplet limit (>202)
|
||||
* The machine to orchestrate the tests should have the following installed:
|
||||
* A clone of the [testnet repository][testnet-repo]
|
||||
* This repository contains all the scripts mentioned in the reminder of this section
|
||||
* [Digital Ocean CLI][doctl]
|
||||
* [Terraform CLI][Terraform]
|
||||
* [Ansible CLI][Ansible]
|
||||
|
||||
[testnet-repo]: https://github.com/interchainio/tendermint-testnet
|
||||
[Ansible]: https://docs.ansible.com/ansible/latest/index.html
|
||||
[Terraform]: https://www.terraform.io/docs
|
||||
[doctl]: https://docs.digitalocean.com/reference/doctl/how-to/install/
|
||||
|
||||
### Requirements for Result Extraction
|
||||
|
||||
* Matlab or Octave
|
||||
* [Prometheus][prometheus] server installed
|
||||
* blockstore DB of one of the full nodes in the testnet
|
||||
* Prometheus DB
|
||||
|
||||
[prometheus]: https://prometheus.io/
|
||||
|
||||
## 200 Node Testnet
|
||||
|
||||
### Running the test
|
||||
|
||||
This section explains how the tests were carried out for reproducibility purposes.
|
||||
|
||||
1. [If you haven't done it before]
|
||||
Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
|
||||
2. Copy file `testnets/testnet200.toml` onto `testnet.toml` (do NOT commit this change)
|
||||
3. Set the variable `VERSION_TAG` in the `Makefile` to the git hash that is to be tested.
|
||||
4. Follow steps 5-10 of the `README.md` to configure and start the 200 node testnet
|
||||
* WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests (see step 9)
|
||||
5. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
|
||||
All nodes should be increasing their heights.
|
||||
6. `ssh` into the `testnet-load-runner`, then copy script `script/200-node-loadscript.sh` and run it from the load runner node.
|
||||
* Before running it, you need to edit the script to provide the IP address of a full node.
|
||||
This node will receive all transactions from the load runner node.
|
||||
* This script will take about 40 mins to run
|
||||
* It is running 90-seconds-long experiments in a loop with different loads
|
||||
7. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
|
||||
8. Verify that the data was collected without errors
|
||||
* at least one blockstore DB for a Tendermint validator
|
||||
* the Prometheus database from the Prometheus node
|
||||
* for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
|
||||
9. **Run `make terraform-destroy`**
|
||||
* Don't forget to type `yes`! Otherwise you're in trouble.
|
||||
|
||||
### Result Extraction
|
||||
|
||||
The method for extracting the results described here is highly manual (and exploratory) at this stage.
|
||||
The Core team should improve it at every iteration to increase the amount of automation.
|
||||
|
||||
#### Steps
|
||||
|
||||
1. Unzip the blockstore into a directory
|
||||
2. Extract the latency report and the raw latencies for all the experiments. Run these commands from the directory containing the blockstore
|
||||
* `go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ > results/report.txt`
|
||||
* `go run github.com/tendermint/tendermint/test/loadtime/cmd/report@3ec6e424d --database-type goleveldb --data-dir ./ --csv results/raw.csv`
|
||||
3. File `report.txt` contains an unordered list of experiments with varying concurrent connections and transaction rate
|
||||
* Create files `report01.txt`, `report02.txt`, `report04.txt` and, for each experiment in file `report.txt`,
|
||||
copy its related lines to the filename that matches the number of connections.
|
||||
* Sort the experiments in `report01.txt` in ascending tx rate order. Likewise for `report02.txt` and `report04.txt`.
|
||||
4. Generate file `report_tabbed.txt` by showing the contents `report01.txt`, `report02.txt`, `report04.txt` side by side
|
||||
* This effectively creates a table where rows are a particular tx rate and columns are a particular number of websocket connections.
|
||||
5. Extract the raw latencies from file `raw.csv` using the following bash loop. This creates a `.csv` file and a `.dat` file per experiment.
|
||||
The format of the `.dat` files is amenable to loading them as matrices in Octave
|
||||
|
||||
```bash
|
||||
uuids=($(cat report01.txt report02.txt report04.txt | grep '^Experiment ID: ' | awk '{ print $3 }'))
|
||||
c=1
|
||||
for i in 01 02 04; do
|
||||
for j in 0025 0050 0100 0200; do
|
||||
echo $i $j $c "${uuids[$c]}"
|
||||
filename=c${i}_r${j}
|
||||
grep ${uuids[$c]} raw.csv > ${filename}.csv
|
||||
cat ${filename}.csv | tr , ' ' | awk '{ print $2, $3 }' > ${filename}.dat
|
||||
c=$(expr $c + 1)
|
||||
done
|
||||
done
|
||||
```
|
||||
|
||||
6. Enter Octave
|
||||
7. Load all `.dat` files generated in step 5 into matrices using this Octave code snippet
|
||||
|
||||
```octave
|
||||
conns = { "01"; "02"; "04" };
|
||||
rates = { "0025"; "0050"; "0100"; "0200" };
|
||||
for i = 1:length(conns)
|
||||
for j = 1:length(rates)
|
||||
filename = strcat("c", conns{i}, "_r", rates{j}, ".dat");
|
||||
load("-ascii", filename);
|
||||
endfor
|
||||
endfor
|
||||
```
|
||||
|
||||
8. Set variable release to the current release undergoing QA
|
||||
|
||||
```octave
|
||||
release = "v0.34.x";
|
||||
```
|
||||
|
||||
9. Generate a plot with all (or some) experiments, where the X axis is the experiment time,
|
||||
and the y axis is the latency of transactions.
|
||||
The following snippet plots all experiments.
|
||||
|
||||
```octave
|
||||
legends = {};
|
||||
hold off;
|
||||
for i = 1:length(conns)
|
||||
for j = 1:length(rates)
|
||||
data_name = strcat("c", conns{i}, "_r", rates{j});
|
||||
l = strcat("c=", conns{i}, " r=", rates{j});
|
||||
m = eval(data_name); plot((m(:,1) - min(m(:,1))) / 1e+9, m(:,2) / 1e+9, ".");
|
||||
hold on;
|
||||
legends(1, end+1) = l;
|
||||
endfor
|
||||
endfor
|
||||
legend(legends, "location", "northeastoutside");
|
||||
xlabel("experiment time (s)");
|
||||
ylabel("latency (s)");
|
||||
t = sprintf("200-node testnet - %s", release);
|
||||
title(t);
|
||||
```
|
||||
|
||||
10. Consider adjusting the axis, in case you want to compare your results to the baseline, for instance
|
||||
|
||||
```octave
|
||||
axis([0, 100, 0, 30], "tic");
|
||||
```
|
||||
|
||||
11. Use Octave's GUI menu to save the plot (e.g. as `.png`)
|
||||
|
||||
12. Repeat steps 9 and 10 to obtain as many plots as deemed necessary.
|
||||
|
||||
13. To generate a latency vs throughput plot, using the raw CSV file generated
|
||||
in step 2, follow the instructions for the [`latency_throughput.py`] script.
|
||||
|
||||
[`latency_throughput.py`]: ../../scripts/qa/reporting/README.md
|
||||
|
||||
#### Extracting Prometheus Metrics
|
||||
|
||||
1. Stop the prometheus server if it is running as a service (e.g. a `systemd` unit).
|
||||
2. Unzip the prometheus database retrieved from the testnet, and move it to replace the
|
||||
local prometheus database.
|
||||
3. Start the prometheus server and make sure no error logs appear at start up.
|
||||
4. Introduce the metrics you want to gather or plot.
|
||||
|
||||
## Rotating Node Testnet
|
||||
|
||||
### Running the test
|
||||
|
||||
This section explains how the tests were carried out for reproducibility purposes.
|
||||
|
||||
1. [If you haven't done it before]
|
||||
Follow steps 1-4 of the `README.md` at the top of the testnet repository to configure Terraform, and `doctl`.
|
||||
2. Copy file `testnet_rotating.toml` onto `testnet.toml` (do NOT commit this change)
|
||||
3. Set variable `VERSION_TAG` to the git hash that is to be tested.
|
||||
4. Run `make terraform-apply EPHEMERAL_SIZE=25`
|
||||
* WARNING: Do NOT forget to run `make terraform-destroy` as soon as you are done with the tests
|
||||
5. Follow steps 6-10 of the `README.md` to configure and start the "stable" part of the rotating node testnet
|
||||
6. As a sanity check, connect to the Prometheus node's web interface and check the graph for the `tendermint_consensus_height` metric.
|
||||
All nodes should be increasing their heights.
|
||||
7. On a different shell,
|
||||
* run `make runload ROTATE_CONNECTIONS=X ROTATE_TX_RATE=Y`
|
||||
* `X` and `Y` should reflect a load below the saturation point (see, e.g.,
|
||||
[this paragraph](./v034/README.md#finding-the-saturation-point) for further info)
|
||||
8. Run `make rotate` to start the script that creates the ephemeral nodes, and kills them when they are caught up.
|
||||
* WARNING: If you run this command from your laptop, the laptop needs to be up and connected for full length
|
||||
of the experiment.
|
||||
9. When the height of the chain reaches 3000, stop the `make rotate` script
|
||||
10. When the rotate script has made two iterations (i.e., all ephemeral nodes have caught up twice)
|
||||
after height 3000 was reached, stop `make rotate`
|
||||
11. Run `make retrieve-data` to gather all relevant data from the testnet into the orchestrating machine
|
||||
12. Verify that the data was collected without errors
|
||||
* at least one blockstore DB for a Tendermint validator
|
||||
* the Prometheus database from the Prometheus node
|
||||
* for extra care, you can run `zip -T` on the `prometheus.zip` file and (one of) the `blockstore.db.zip` file(s)
|
||||
13. **Run `make terraform-destroy`**
|
||||
|
||||
Steps 8 to 10 are highly manual at the moment and will be improved in next iterations.
|
||||
|
||||
### Result Extraction
|
||||
|
||||
In order to obtain a latency plot, follow the instructions above for the 200 node experiment, but:
|
||||
|
||||
* The `results.txt` file contains only one experiment
|
||||
* Therefore, no need for any `for` loops
|
||||
|
||||
As for prometheus, the same method as for the 200 node experiment can be applied.
|
||||
278
docs/qa/v034/README.md
Normal file
@@ -0,0 +1,278 @@
|
||||
---
|
||||
order: 1
|
||||
parent:
|
||||
title: Tendermint Quality Assurance Results for v0.34.x
|
||||
description: This is a report on the results obtained when running v0.34.x on testnets
|
||||
order: 2
|
||||
---
|
||||
|
||||
# v0.34.x
|
||||
|
||||
## 200 Node Testnet
|
||||
|
||||
### Finding the Saturation Point
|
||||
|
||||
The first goal when examining the results of the tests is identifying the saturation point.
|
||||
The saturation point is a setup with a transaction load big enough to prevent the testnet
|
||||
from being stable: the load runner tries to produce slightly more transactions than can
|
||||
be processed by the testnet.
|
||||
|
||||
The following table summarizes the results for v0.34.x, for the different experiments
|
||||
(extracted from file [`v034_report_tabbed.txt`](./img/v034_report_tabbed.txt)).
|
||||
|
||||
The X axis of this table is `c`, the number of connections created by the load runner process to the target node.
|
||||
The Y axis of this table is `r`, the rate or number of transactions issued per second.
|
||||
|
||||
| | c=1 | c=2 | c=4 |
|
||||
| :--- | ----: | ----: | ----: |
|
||||
| r=25 | 2225 | 4450 | 8900 |
|
||||
| r=50 | 4450 | 8900 | 17800 |
|
||||
| r=100 | 8900 | 17800 | 35600 |
|
||||
| r=200 | 17800 | 35600 | 38660 |
|
||||
|
||||
The table shows the number of 1024-byte-long transactions that were produced by the load runner,
|
||||
and processed by Tendermint, during the 90 seconds of the experiment's duration.
|
||||
Each cell in the table refers to an experiment with a particular number of websocket connections (`c`)
|
||||
to a chosen validator, and the number of transactions per second that the load runner
|
||||
tries to produce (`r`). Note that the overall load that the tool attempts to generate is $c \cdot r$.
|
||||
|
||||
We can see that the saturation point is beyond the diagonal that spans cells
|
||||
|
||||
* `r=200,c=2`
|
||||
* `r=100,c=4`
|
||||
|
||||
given that the total transactions should be close to the product of the rate, the number of connections,
|
||||
and the experiment time (89 seconds, since the last batch never gets sent).
|
||||
|
||||
All experiments below the saturation diagonal (`r=200,c=4`) have in common that the total
|
||||
number of transactions processed is noticeably less than the product $c \cdot r \cdot 89$,
|
||||
which is the expected number of transactions when the system is able to deal well with the
|
||||
load.
|
||||
With `r=200,c=4`, we obtained 38660 whereas the theoretical number of transactions should
|
||||
have been $200 \cdot 4 \cdot 89 = 71200$.
|
||||
|
||||
At this point, we chose an experiment at the limit of the saturation diagonal,
|
||||
in order to further study the performance of this release.
|
||||
**The chosen experiment is `r=200,c=2`**.
|
||||
|
||||
This is a plot of the CPU load (average over 1 minute, as output by `top`) of the load runner for `r=200,c=2`,
|
||||
where we can see that the load stays close to 0 most of the time.
|
||||
|
||||

|
||||
|
||||
### Examining latencies
|
||||
|
||||
The method described [here](../method.md) allows us to plot the latencies of transactions
|
||||
for all experiments.
|
||||
|
||||

|
||||
|
||||
As we can see, even the experiments beyond the saturation diagonal managed to keep
|
||||
transaction latency stable (i.e. not constantly increasing).
|
||||
Our interpretation for this is that contention within Tendermint was propagated,
|
||||
via the websockets, to the load runner,
|
||||
hence the load runner could not produce the target load, but a fraction of it.
|
||||
|
||||
Further examination of the Prometheus data (see below), showed that the mempool contained many transactions
|
||||
at steady state, but did not grow much without quickly returning to this steady state. This demonstrates
|
||||
that the transactions were able to be processed by the Tendermint network at least as quickly as they
|
||||
were submitted to the mempool. Finally, the test script made sure that, at the end of an experiment, the
|
||||
mempool was empty so that all transactions submitted to the chain were processed.
|
||||
|
||||
Finally, the number of points present in the plot appears to be much less than expected given the
|
||||
number of transactions in each experiment, particularly close to or above the saturation diagonal.
|
||||
This is a visual effect of the plot; what appear to be points in the plot are actually potentially huge
|
||||
clusters of points. To corroborate this, we have zoomed in the plot above by setting (carefully chosen)
|
||||
tiny axis intervals. The cluster shown below looks like a single point in the plot above.
|
||||
|
||||

|
||||
|
||||
The plot of latencies can we used as a baseline to compare with other releases.
|
||||
|
||||
The following plot summarizes average latencies versus overall throughputs
|
||||
across different numbers of WebSocket connections to the node into which
|
||||
transactions are being loaded.
|
||||
|
||||

|
||||
|
||||
### Prometheus Metrics on the Chosen Experiment
|
||||
|
||||
As mentioned [above](#finding-the-saturation-point), the chosen experiment is `r=200,c=2`.
|
||||
This section further examines key metrics for this experiment extracted from Prometheus data.
|
||||
|
||||
#### Mempool Size
|
||||
|
||||
The mempool size, a count of the number of transactions in the mempool, was shown to be stable and homogeneous
|
||||
at all full nodes. It did not exhibit any unconstrained growth.
|
||||
The plot below shows the evolution over time of the cumulative number of transactions inside all full nodes' mempools
|
||||
at a given time.
|
||||
The two spikes that can be observed correspond to a period where consensus instances proceeded beyond the initial round
|
||||
at some nodes.
|
||||
|
||||

|
||||
|
||||
The plot below shows evolution of the average over all full nodes, which oscillates between 1500 and 2000
|
||||
outstanding transactions.
|
||||
|
||||

|
||||
|
||||
The peaks observed coincide with the moments when some nodes proceeded beyond the initial round of consensus (see below).
|
||||
|
||||
#### Peers
|
||||
|
||||
The number of peers was stable at all nodes.
|
||||
It was higher for the seed nodes (around 140) than for the rest (between 21 and 74).
|
||||
The fact that non-seed nodes reach more than 50 peers is due to #9548.
|
||||
|
||||

|
||||
|
||||
#### Consensus Rounds per Height
|
||||
|
||||
Most heights took just one round, but some nodes needed to advance to round 1 at some point.
|
||||
|
||||

|
||||
|
||||
#### Blocks Produced per Minute, Transactions Processed per Minute
|
||||
|
||||
The blocks produced per minute are the slope of this plot.
|
||||
|
||||

|
||||
|
||||
Over a period of 2 minutes, the height goes from 530 to 569.
|
||||
This results in an average of 19.5 blocks produced per minute.
|
||||
|
||||
The transactions processed per minute are the slope of this plot.
|
||||
|
||||

|
||||
|
||||
Over a period of 2 minutes, the total goes from 64525 to 100125 transactions,
|
||||
resulting in 17800 transactions per minute. However, we can see in the plot that
|
||||
all transactions in the load are processed long before the two minutes.
|
||||
If we adjust the time window when transactions are processed (approx. 105 seconds),
|
||||
we obtain 20343 transactions per minute.
|
||||
|
||||
#### Memory Resident Set Size
|
||||
|
||||
Resident Set Size of all monitored processes is plotted below.
|
||||
|
||||

|
||||
|
||||
The average over all processes oscillates around 1.2 GiB and does not demonstrate unconstrained growth.
|
||||
|
||||

|
||||
|
||||
#### CPU utilization
|
||||
|
||||
The best metric from Prometheus to gauge CPU utilization in a Unix machine is `load1`,
|
||||
as it usually appears in the
|
||||
[output of `top`](https://www.digitalocean.com/community/tutorials/load-average-in-linux).
|
||||
|
||||

|
||||
|
||||
It is contained in most cases below 5, which is generally considered acceptable load.
|
||||
|
||||
### Test Result
|
||||
|
||||
**Result: N/A** (v0.34.x is the baseline)
|
||||
|
||||
Date: 2022-10-14
|
||||
|
||||
Version: 3ec6e424d6ae4c96867c2dcf8310572156068bb6
|
||||
|
||||
## Rotating Node Testnet
|
||||
|
||||
For this testnet, we will use a load that can safely be considered below the saturation
|
||||
point for the size of this testnet (between 13 and 38 full nodes): `c=4,r=800`.
|
||||
|
||||
N.B.: The version of Tendermint used for these tests is affected by #9539.
|
||||
However, the reduced load that reaches the mempools is orthogonal to functionality
|
||||
we are focusing on here.
|
||||
|
||||
### Latencies
|
||||
|
||||
The plot of all latencies can be seen in the following plot.
|
||||
|
||||

|
||||
|
||||
We can observe there are some very high latencies, towards the end of the test.
|
||||
Upon suspicion that they are duplicate transactions, we examined the latencies
|
||||
raw file and discovered there are more than 100K duplicate transactions.
|
||||
|
||||
The following plot shows the latencies file where all duplicate transactions have
|
||||
been removed, i.e., only the first occurrence of a duplicate transaction is kept.
|
||||
|
||||

|
||||
|
||||
This problem, existing in `v0.34.x`, will need to be addressed, perhaps in the same way
|
||||
we addressed it when running the 200 node test with high loads: increasing the `cache_size`
|
||||
configuration parameter.
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
The set of metrics shown here are less than for the 200 node experiment.
|
||||
We are only interested in those for which the catch-up process (blocksync) may have an impact.
|
||||
|
||||
#### Blocks and Transactions per minute
|
||||
|
||||
Just as shown for the 200 node test, the blocks produced per minute are the gradient of this plot.
|
||||
|
||||

|
||||
|
||||
Over a period of 5229 seconds, the height goes from 2 to 3638.
|
||||
This results in an average of 41 blocks produced per minute.
|
||||
|
||||
The following plot shows only the heights reported by ephemeral nodes
|
||||
(which are also included in the plot above). Note that the _height_ metric
|
||||
is only showed _once the node has switched to consensus_, hence the gaps
|
||||
when nodes are killed, wiped out, started from scratch, and catching up.
|
||||
|
||||

|
||||
|
||||
The transactions processed per minute are the gradient of this plot.
|
||||
|
||||

|
||||
|
||||
The small lines we see periodically close to `y=0` are the transactions that
|
||||
ephemeral nodes start processing when they are caught up.
|
||||
|
||||
Over a period of 5229 minutes, the total goes from 0 to 387697 transactions,
|
||||
resulting in 4449 transactions per minute. We can see some abrupt changes in
|
||||
the plot's gradient. This will need to be investigated.
|
||||
|
||||
#### Peers
|
||||
|
||||
The plot below shows the evolution in peers throughout the experiment.
|
||||
The periodic changes observed are due to the ephemeral nodes being stopped,
|
||||
wiped out, and recreated.
|
||||
|
||||

|
||||
|
||||
The validators' plots are concentrated at the higher part of the graph, whereas the ephemeral nodes
|
||||
are mostly at the lower part.
|
||||
|
||||
#### Memory Resident Set Size
|
||||
|
||||
The average Resident Set Size (RSS) over all processes seems stable, and slightly growing toward the end.
|
||||
This might be related to the increased in transaction load observed above.
|
||||
|
||||

|
||||
|
||||
The memory taken by the validators and the ephemeral nodes (when they are up) is comparable.
|
||||
|
||||
#### CPU utilization
|
||||
|
||||
The plot shows metric `load1` for all nodes.
|
||||
|
||||

|
||||
|
||||
It is contained under 5 most of the time, which is considered normal load.
|
||||
The purple line, which follows a different pattern is the validator receiving all
|
||||
transactions, via RPC, from the load runner process.
|
||||
|
||||
### Test Result
|
||||
|
||||
**Result: N/A**
|
||||
|
||||
Date: 2022-10-10
|
||||
|
||||
Version: a28c987f5a604ff66b515dd415270063e6fb069d
|
||||
BIN
docs/qa/v034/img/v034_200node_latencies.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
docs/qa/v034/img/v034_200node_latencies_zoomed.png
Normal file
|
After Width: | Height: | Size: 34 KiB |
BIN
docs/qa/v034/img/v034_latency_throughput.png
Normal file
|
After Width: | Height: | Size: 35 KiB |
BIN
docs/qa/v034/img/v034_r200c2_heights.png
Normal file
|
After Width: | Height: | Size: 378 KiB |
BIN
docs/qa/v034/img/v034_r200c2_load-runner.png
Normal file
|
After Width: | Height: | Size: 150 KiB |
BIN
docs/qa/v034/img/v034_r200c2_load1.png
Normal file
|
After Width: | Height: | Size: 759 KiB |
BIN
docs/qa/v034/img/v034_r200c2_mempool_size.png
Normal file
|
After Width: | Height: | Size: 2.4 MiB |
BIN
docs/qa/v034/img/v034_r200c2_mempool_size_avg.png
Normal file
|
After Width: | Height: | Size: 192 KiB |
BIN
docs/qa/v034/img/v034_r200c2_peers.png
Normal file
|
After Width: | Height: | Size: 130 KiB |
BIN
docs/qa/v034/img/v034_r200c2_rounds.png
Normal file
|
After Width: | Height: | Size: 1.0 MiB |
BIN
docs/qa/v034/img/v034_r200c2_rss.png
Normal file
|
After Width: | Height: | Size: 926 KiB |
BIN
docs/qa/v034/img/v034_r200c2_rss_avg.png
Normal file
|
After Width: | Height: | Size: 157 KiB |
BIN
docs/qa/v034/img/v034_r200c2_total-txs.png
Normal file
|
After Width: | Height: | Size: 534 KiB |
52
docs/qa/v034/img/v034_report_tabbed.txt
Normal file
@@ -0,0 +1,52 @@
|
||||
Experiment ID: 3d5cf4ef-1a1a-4b46-aa2d-da5643d2e81e │Experiment ID: 80e472ec-13a1-4772-a827-3b0c907fb51d │Experiment ID: 07aca6cf-c5a4-4696-988f-e3270fc6333b
|
||||
│ │
|
||||
Connections: 1 │ Connections: 2 │ Connections: 4
|
||||
Rate: 25 │ Rate: 25 │ Rate: 25
|
||||
Size: 1024 │ Size: 1024 │ Size: 1024
|
||||
│ │
|
||||
Total Valid Tx: 2225 │ Total Valid Tx: 4450 │ Total Valid Tx: 8900
|
||||
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
|
||||
Minimum Latency: 599.404362ms │ Minimum Latency: 448.145181ms │ Minimum Latency: 412.485729ms
|
||||
Maximum Latency: 3.539686885s │ Maximum Latency: 3.237392049s │ Maximum Latency: 12.026665368s
|
||||
Average Latency: 1.441485349s │ Average Latency: 1.441267946s │ Average Latency: 2.150192457s
|
||||
Standard Deviation: 541.049869ms │ Standard Deviation: 525.040007ms │ Standard Deviation: 2.233852478s
|
||||
│ │
|
||||
Experiment ID: 953dc544-dd40-40e8-8712-20c34c3ce45e │Experiment ID: d31fc258-16e7-45cd-9dc8-13ab87bc0b0a │Experiment ID: 15d90a7e-b941-42f4-b411-2f15f857739e
|
||||
│ │
|
||||
Connections: 1 │ Connections: 2 │ Connections: 4
|
||||
Rate: 50 │ Rate: 50 │ Rate: 50
|
||||
Size: 1024 │ Size: 1024 │ Size: 1024
|
||||
│ │
|
||||
Total Valid Tx: 4450 │ Total Valid Tx: 8900 │ Total Valid Tx: 17800
|
||||
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
|
||||
Minimum Latency: 482.046942ms │ Minimum Latency: 435.458913ms │ Minimum Latency: 510.746448ms
|
||||
Maximum Latency: 3.761483455s │ Maximum Latency: 7.175583584s │ Maximum Latency: 6.551497882s
|
||||
Average Latency: 1.450408183s │ Average Latency: 1.681673116s │ Average Latency: 1.738083875s
|
||||
Standard Deviation: 587.560056ms │ Standard Deviation: 1.147902047s │ Standard Deviation: 943.46522ms
|
||||
│ │
|
||||
Experiment ID: 9a0b9980-9ce6-4db5-a80a-65ca70294b87 │Experiment ID: df8fa4f4-80af-4ded-8a28-356d15018b43 │Experiment ID: d0e41c2c-89c0-4f38-8e34-ca07adae593a
|
||||
│ │
|
||||
Connections: 1 │ Connections: 2 │ Connections: 4
|
||||
Rate: 100 │ Rate: 100 │ Rate: 100
|
||||
Size: 1024 │ Size: 1024 │ Size: 1024
|
||||
│ │
|
||||
Total Valid Tx: 8900 │ Total Valid Tx: 17800 │ Total Valid Tx: 35600
|
||||
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
|
||||
Minimum Latency: 477.417219ms │ Minimum Latency: 564.29247ms │ Minimum Latency: 840.71089ms
|
||||
Maximum Latency: 6.63744785s │ Maximum Latency: 6.988553219s │ Maximum Latency: 9.555312398s
|
||||
Average Latency: 1.561216103s │ Average Latency: 1.76419063s │ Average Latency: 3.200941683s
|
||||
Standard Deviation: 1.011333552s │ Standard Deviation: 1.068459423s │ Standard Deviation: 1.732346601s
|
||||
│ │
|
||||
Experiment ID: 493df3ee-4a36-4bce-80f8-6d65da66beda │Experiment ID: 13060525-f04f-46f6-8ade-286684b2fe50 │Experiment ID: 1777cbd2-8c96-42e4-9ec7-9b21f2225e4d
|
||||
│ │
|
||||
Connections: 1 │ Connections: 2 │ Connections: 4
|
||||
Rate: 200 │ Rate: 200 │ Rate: 200
|
||||
Size: 1024 │ Size: 1024 │ Size: 1024
|
||||
│ │
|
||||
Total Valid Tx: 17800 │ Total Valid Tx: 35600 │ Total Valid Tx: 38660
|
||||
Total Negative Latencies: 0 │ Total Negative Latencies: 0 │ Total Negative Latencies: 0
|
||||
Minimum Latency: 493.705261ms │ Minimum Latency: 955.090573ms │ Minimum Latency: 1.9485821s
|
||||
Maximum Latency: 7.440921872s │ Maximum Latency: 10.086673491s │ Maximum Latency: 17.73103976s
|
||||
Average Latency: 1.875510582s │ Average Latency: 3.438130099s │ Average Latency: 8.143862237s
|
||||
Standard Deviation: 1.304336995s │ Standard Deviation: 1.966391574s │ Standard Deviation: 3.943140002s
|
||||
|
||||
BIN
docs/qa/v034/img/v034_rotating_heights.png
Normal file
|
After Width: | Height: | Size: 157 KiB |
BIN
docs/qa/v034/img/v034_rotating_heights_ephe.png
Normal file
|
After Width: | Height: | Size: 140 KiB |
BIN
docs/qa/v034/img/v034_rotating_latencies.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
docs/qa/v034/img/v034_rotating_latencies_uniq.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
docs/qa/v034/img/v034_rotating_load1.png
Normal file
|
After Width: | Height: | Size: 1.5 MiB |
BIN
docs/qa/v034/img/v034_rotating_peers.png
Normal file
|
After Width: | Height: | Size: 486 KiB |
BIN
docs/qa/v034/img/v034_rotating_rss_avg.png
Normal file
|
After Width: | Height: | Size: 193 KiB |
BIN
docs/qa/v034/img/v034_rotating_total-txs.png
Normal file
|
After Width: | Height: | Size: 197 KiB |
48
scripts/qa/reporting/README.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Reporting Scripts
|
||||
|
||||
This directory contains just one utility script at present that is used in
|
||||
reporting/QA.
|
||||
|
||||
## Latency vs Throughput Plotting
|
||||
|
||||
[`latency_throughput.py`](./latency_throughput.py) is a Python script that uses
|
||||
[matplotlib] to plot a graph of transaction latency vs throughput rate based on
|
||||
the CSV output generated by the [loadtime reporting
|
||||
tool](../../../test/loadtime/cmd/report/).
|
||||
|
||||
### Setup
|
||||
|
||||
Execute the following within this directory (the same directory as the
|
||||
`latency_throughput.py` file).
|
||||
|
||||
```bash
|
||||
# Create a virtual environment into which to install your dependencies
|
||||
python3 -m venv .venv
|
||||
|
||||
# Activate the virtual environment
|
||||
source .venv/bin/activate
|
||||
|
||||
# Install dependencies listed in requirements.txt
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Show usage instructions and parameters
|
||||
./latency_throughput.py --help
|
||||
```
|
||||
|
||||
### Running
|
||||
|
||||
```bash
|
||||
# Do the following while ensuring that the virtual environment is activated (see
|
||||
# the Setup steps).
|
||||
#
|
||||
# This will generate a plot in a PNG file called 'tm034.png' in the current
|
||||
# directory based on the reporting tool CSV output in the "raw.csv" file. The
|
||||
# '-t' flag overrides the default title at the top of the plot.
|
||||
|
||||
./latency_throughput.py \
|
||||
-t 'Tendermint v0.34.x Latency vs Throughput' \
|
||||
./tm034.png \
|
||||
/path/to/csv/files/raw.csv
|
||||
```
|
||||
|
||||
[matplotlib]: https://matplotlib.org/
|
||||
170
scripts/qa/reporting/latency_throughput.py
Executable file
@@ -0,0 +1,170 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
A simple script to parse the CSV output from the loadtime reporting tool (see
|
||||
https://github.com/tendermint/tendermint/tree/main/test/loadtime/cmd/report).
|
||||
|
||||
Produces a plot of average transaction latency vs total transaction throughput
|
||||
according to the number of load testing tool WebSocket connections to the
|
||||
Tendermint node.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import logging
|
||||
import sys
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
DEFAULT_TITLE = "Tendermint latency vs throughput"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Renders a latency vs throughput diagram "
|
||||
"for a set of transactions provided by the loadtime reporting tool",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
parser.add_argument('-t',
|
||||
'--title',
|
||||
default=DEFAULT_TITLE,
|
||||
help='Plot title')
|
||||
parser.add_argument('output_image',
|
||||
help='Output image file (in PNG format)')
|
||||
parser.add_argument(
|
||||
'input_csv_file',
|
||||
nargs='+',
|
||||
help="CSV input file from which to read transaction data "
|
||||
"- must have been generated by the loadtime reporting tool")
|
||||
args = parser.parse_args()
|
||||
|
||||
logging.basicConfig(format='%(levelname)s\t%(message)s',
|
||||
stream=sys.stdout,
|
||||
level=logging.INFO)
|
||||
plot_latency_vs_throughput(args.input_csv_file,
|
||||
args.output_image,
|
||||
title=args.title)
|
||||
|
||||
|
||||
def plot_latency_vs_throughput(input_files, output_image, title=DEFAULT_TITLE):
|
||||
avg_latencies, throughput_rates = process_input_files(input_files, )
|
||||
|
||||
fig, ax = plt.subplots()
|
||||
|
||||
connections = sorted(avg_latencies.keys())
|
||||
for c in connections:
|
||||
tr = np.array(throughput_rates[c])
|
||||
al = np.array(avg_latencies[c])
|
||||
label = '%d connection%s' % (c, '' if c == 1 else 's')
|
||||
ax.plot(tr, al, 'o-', label=label)
|
||||
|
||||
ax.set_title(title)
|
||||
ax.set_xlabel('Throughput rate (tx/s)')
|
||||
ax.set_ylabel('Average transaction latency (s)')
|
||||
|
||||
plt.legend(loc='upper left')
|
||||
plt.savefig(output_image)
|
||||
|
||||
|
||||
def process_input_files(input_files):
|
||||
# Experimental data from which we will derive the latency vs throughput
|
||||
# statistics
|
||||
experiments = {}
|
||||
|
||||
for input_file in input_files:
|
||||
logging.info('Reading %s...' % input_file)
|
||||
|
||||
with open(input_file, 'rt') as inf:
|
||||
reader = csv.DictReader(inf)
|
||||
for tx in reader:
|
||||
experiments = process_tx(experiments, tx)
|
||||
|
||||
return compute_experiments_stats(experiments)
|
||||
|
||||
|
||||
def process_tx(experiments, tx):
|
||||
exp_id = tx['experiment_id']
|
||||
# Block time is nanoseconds from the epoch - convert to seconds
|
||||
block_time = float(tx['block_time']) / (10**9)
|
||||
# Duration is also in nanoseconds - convert to seconds
|
||||
duration = float(tx['duration_ns']) / (10**9)
|
||||
connections = int(tx['connections'])
|
||||
rate = int(tx['rate'])
|
||||
|
||||
if exp_id not in experiments:
|
||||
experiments[exp_id] = {
|
||||
'connections': connections,
|
||||
'rate': rate,
|
||||
'block_time_min': block_time,
|
||||
# We keep track of the latency associated with the minimum block
|
||||
# time to estimate the start time of the experiment
|
||||
'block_time_min_duration': duration,
|
||||
'block_time_max': block_time,
|
||||
'total_latencies': duration,
|
||||
'tx_count': 1,
|
||||
}
|
||||
logging.info('Found experiment %s with rate=%d, connections=%d' %
|
||||
(exp_id, rate, connections))
|
||||
else:
|
||||
# Validation
|
||||
for field in ['connections', 'rate']:
|
||||
val = int(tx[field])
|
||||
if val != experiments[exp_id][field]:
|
||||
raise Exception(
|
||||
'Found multiple distinct values for field '
|
||||
'"%s" for the same experiment (%s): %d and %d' %
|
||||
(field, exp_id, val, experiments[exp_id][field]))
|
||||
|
||||
if block_time < experiments[exp_id]['block_time_min']:
|
||||
experiments[exp_id]['block_time_min'] = block_time
|
||||
experiments[exp_id]['block_time_min_duration'] = duration
|
||||
if block_time > experiments[exp_id]['block_time_max']:
|
||||
experiments[exp_id]['block_time_max'] = block_time
|
||||
|
||||
experiments[exp_id]['total_latencies'] += duration
|
||||
experiments[exp_id]['tx_count'] += 1
|
||||
|
||||
return experiments
|
||||
|
||||
|
||||
def compute_experiments_stats(experiments):
|
||||
"""Compute average latency vs throughput rate statistics from the given
|
||||
experiments"""
|
||||
stats = {}
|
||||
|
||||
# Compute average latency and throughput rate for each experiment
|
||||
for exp_id, exp in experiments.items():
|
||||
conns = exp['connections']
|
||||
avg_latency = exp['total_latencies'] / exp['tx_count']
|
||||
exp_start_time = exp['block_time_min'] - exp['block_time_min_duration']
|
||||
exp_duration = exp['block_time_max'] - exp_start_time
|
||||
throughput_rate = exp['tx_count'] / exp_duration
|
||||
if conns not in stats:
|
||||
stats[conns] = []
|
||||
|
||||
stats[conns].append({
|
||||
'avg_latency': avg_latency,
|
||||
'throughput_rate': throughput_rate,
|
||||
})
|
||||
|
||||
# Sort stats for each number of connections in order of increasing
|
||||
# throughput rate, and then extract average latencies and throughput rates
|
||||
# as separate data series.
|
||||
conns = sorted(stats.keys())
|
||||
avg_latencies = {}
|
||||
throughput_rates = {}
|
||||
for c in conns:
|
||||
stats[c] = sorted(stats[c], key=lambda s: s['throughput_rate'])
|
||||
avg_latencies[c] = []
|
||||
throughput_rates[c] = []
|
||||
for s in stats[c]:
|
||||
avg_latencies[c].append(s['avg_latency'])
|
||||
throughput_rates[c].append(s['throughput_rate'])
|
||||
logging.info('For %d connection(s): '
|
||||
'throughput rate = %.6f tx/s\t'
|
||||
'average latency = %.6fs' %
|
||||
(c, s['throughput_rate'], s['avg_latency']))
|
||||
|
||||
return (avg_latencies, throughput_rates)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
11
scripts/qa/reporting/requirements.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
contourpy==1.0.5
|
||||
cycler==0.11.0
|
||||
fonttools==4.37.4
|
||||
kiwisolver==1.4.4
|
||||
matplotlib==3.6.1
|
||||
numpy==1.23.4
|
||||
packaging==21.3
|
||||
Pillow==9.2.0
|
||||
pyparsing==3.0.9
|
||||
python-dateutil==2.8.2
|
||||
six==1.16.0
|
||||