This should address last night's failure. We've taken the perspective
of "the load generator shouldn't cause tests to fail" in recent
days/weeks, and I think this is just a next step along that line. The
e2e tests shouldn't test performance.
I included some comments indicating the ways that this isn't ideal (it
is perhaps not), and I think that if test networks could make
assertions about the required rate, that might be a cool future
improvement (and good, perhaps, for system benchmarking.)
This document attempts to capture and discuss some of the areas of Tendermint that seem to be cited as causing performance issue. I'm hoping to continue to gather feedback and input on this document to better understand what issues Tendermint performance may cause for our users.
The overall goal of this document is to allow the maintainers and community to get a better sense of these issues and to be more capably able to discuss them and weight trade-offs about any proposed performance-focused changes. This document does not aim to propose any performance improvements. It does suggest useful places for benchmarks and places where additional metrics would be useful for diagnosing and further understanding Tendermint performance.
Please comment with areas where my reasoning seems off or with additional areas that Tendermint performance may be causing user pain.
I think the `Sync` check covers our primary use case, and perhaps we
can turn this back on in the future after some kind of event-system
rewrite, or RPC rewrite that will avoid the serverside timeout.
These are mostly the timeouts that I think we're still hitting in CI.
At this point, the tests (on master) pass on my local machine (which is quite beefy) so I think this is just the first in (perhaps?) a sequence of changes that attempt to change timeouts and load patterns so that the tests pass in CI more reliably.
Communication in Tendermint among consensus nodes, applications, and operator
tools all use different message formats and transport mechanisms. In some
cases there are multiple options. Having all these options complicates both the
code and the developer experience, and hides bugs. To support a more robust,
trustworthy, and usable system, we should document which communication paths
are essential, which could be removed or reduced in scope, and what we can
improve for the most important use cases.
This document proposes a variety of possible improvements of varying size and
scope. Specific design proposals should get their own documentation.
The previous implemention of hybrid set testing, which was entirely my
own creation, was a bit peculiar, and I think this probably clears thins up.
The previous implementation had far fewer legacy nodes in hybrid
networks, *and* also for some reason that I can't quite explain,
caused a test case to fail.