Current LWT implementation uses at least three network round trips:
- first, execute PAXOS prepare phase
- second, query the current value of the updated key
- third, propose the change to participating replicas
(there's also learn phase, but we don't wait for it to complete).
The idea behind the optimization implemented by this patch is simple:
piggyback the current value of the updated key on the prepare response
to eliminate one round trip.
To generate less network traffic, only the closest to the coordinator
replica sends data while other participating replicas send digests which
are used to check data consistency.
Note, this patch changes the API of some RPC calls used by PAXOS, but
this should be okay as long as the feature in the early development
stage and marked experimental.
To assess the impact of this optimization on LWT performance, I ran a
simple benchmark that starts a number of concurrent clients each of
which updates its own key (uncontended case) stored in a cluster of
three AWS i3.2xlarge nodes located in the same region (us-west-1) and
measures the aggregate bandwidth and latency. The test uses shard-aware
gocql driver. Here are the results:
latency 99% (ms) bandwidth (rq/s) timeouts (rq/s)
clients before after before after before after
1 2 2 626 637 0 0
5 4 3 2616 2843 0 0
10 3 3 4493 4767 0 0
50 7 7 10567 10833 0 0
100 15 15 12265 12934 0 0
200 48 30 13593 14317 0 0
400 185 60 14796 15549 0 0
600 290 94 14416 15669 0 0
800 568 118 14077 15820 2 0
1000 710 118 13088 15830 9 0
2000 1388 232 13342 15658 85 0
3000 1110 363 13282 15422 233 0
4000 1735 454 13387 15385 329 0
That is, this optimization improves max LWT bandwidth by about 15%
and allows to run 3-4x more clients while maintaining the same level
of system responsiveness.