The xen protocol needs works by filling positions in a circular ring. The
indexes become free to be used again when they are processed by the other side.
There is a problem, however: those indexes must be sequential, because all the
sides share is a produced / consumed index. But there are situations in which
we call get_index() - which produces an index X, but the .then() clause
schedules some other caller of send() to run in our place. That one, in turn,
can call get_index(), then create a packet with index X + 1 that will be put in
the ring before the packet with index X.
If the other end processes this packet very fast, it will respond saying "I
have processed packets up to X + 1". We will act on it as marking X as
processed as well - since it comes before X + 1, and when X is really
processed, chaos will ensue.
The solution for that is to just have the semaphore to count how many spaces we
have in the ring. Once we guarantee that the current caller have space, we then
compute get_index() inside the .then() clause. This works well because the
indexes are all sequential anyway.
For the same reason, we are actually able to remove the queue, and resort to a
simple counter. Once we know there is room, we just get the next index,
whatever it may be.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
we can't reach this place with a negative ref id, so let's assert to make sure
we're fine. Help catching some bugs.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The index in the ring and the packet id tends to be the same. But it doesn't
have to. There are some situations where the backend and the frontend get out
of sync with this, and this is totally valid.
One example is when the backend skb already have enough room to hold all of the
data being transmitted (netback.c, line 1611 @3.16). The netback will respond
immediately, even though there are other pending packets that are not yet fully
processed.
The ring index, then, must come from the rsp value, not from the req/rsp id.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
When a bulk of data is passed from user application, the TCP layer call
output only once to send data. This will slow TX a lot, because the
output will send at most MSS size of data while we might have way more
than MSS to send. We will send again only after remote ack the data we
just sent. This slowness can be seen easily with tso turned off.
To fix, we should send as much as we are allowed to. This patch boosts
TX bandwidth from 0.N MiB/Sec to hundreds MiB/Sec.
Before:
[asias@hjpc pingpong]$ go run client-txtx.go
Server: 192.168.66.123:10000
Connections: 1
Bytes Received(MiB): 10
Total Time(Secs): 76.217338072
Bandwidth(MiB/Sec): 0.13120374252054473
After:
[asias@hjpc pingpong]$ go run client-txtx.go
Server: 192.168.66.123:10000
Connections: 1
Bytes Received(MiB): 100
Total Time(Secs): 0.5105951040000001
Bandwidth(MiB/Sec): 195.84989988466475
This patch adds congestion control to our TCP according to RFC5681.
These four algorithms: slow start, congestion avoidance, fast
retransmit, and fast recovery, are added.
Reviewed-by: Pekka Enberg <penberg@cloudius-systems.com>
After the latest reactor rework from Nadav, it is no longer allowed to use eventfds
in the reactor for OSv. Change the code to use the reactor notifier instead.
We could just use that instead of semaphores altogether. But because the semaphore is
per listener, we need a translation anyway. So let's keep this one doing the interrupt
processing, and the semaphores doing the rest of the work.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
After the latest reactor rework from Nadav, it is no longer allowed to use eventfds
in the reactor for OSv. Change the code to use the reactor notifier instead.
We could just use that instead of semaphores altogether. But because the semaphore is
per listener, we need a translation anyway. So let's keep this one doing the interrupt
processing, and the semaphores doing the rest of the work.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We don't really need to copy keys as the parser is not reused until
we're done.
Also, in case of a single key we don't use map_reduce() which
saves us one allocation (reducer).
Recursion takes up space on stack which takes up space in caches which
means less room for useful data.
In addition to that, a limit on iteration count can be larger than the
limit on recursion, because we're not limited by stack size here.
Also, recursion makes flame-graphs really hard to analyze because
keep_doing() frames appear at different levels of nesting in the
profile leading to many short "towers" instead of one big tower.
This change reuses the same counter for limiting iterations as is used
to limit the number of tasks executed by the reactor before polling.
There was a run-time parameter added for controlling task quota.
Set RTO (retransmission timer) according to RFC6298. Now, we have a
dynamic RTO istead of the hard coded 3 seconds, and an exponential
back-off timer for retransmission.
Tell host to interrupt less. This is useful for tx queue completion
since we do not care much when the tx is completed exactly.
Passed test with memcached and tcp_server.
When doing tcp rx testing, I saw a lot of retransmission because of the
delayed ACK. Our current delayed ACK algorithm does not comply with
what RFC 1122 suggests.
As described in RFC 1122, a host may delay sending an ACK response by up
to 500 ms. Additionally, with a stream of full-sized incoming segments,
ACK responses must be sent for every second segment.
=== Before ===
[asias@hjpc pingpong]$ go run client-rxrx.go
Bytes Sent(MiB): 100
Total Time(Secs): 322.620879376
Bandwidth(MiB/Sec): 0.30996133974160595
78 2.412385 192.168.66.100 -> 192.168.66.123 TCP 32174 37672 > 10000
[ACK] Seq=2149425323 Ack=1000001 Win=229 Len=32120
79 2.612985 192.168.66.100 -> 192.168.66.123 TCP 1514 [TCP Retransmission]
37672 > 10000 [ACK] Seq=2149425323 Ack=1000001 Win=229 Len=1460
80 2.613131 192.168.66.123 -> 192.168.66.100 TCP 54 10000 > 37672
[ACK] Seq=1000001 Ack=2149457443 Win=29200 Len=0
=== After ===
[asias@hjpc pingpong]$ go run client-rxrx.go
Bytes Sent(MiB): 100
Total Time(Secs): 0.244951095
Bandwidth(MiB/Sec): 408.2447559583271
No retransmission is seen.
Assuming the output_stream size is set to 8K, a sequence of writes of
lengths: 128B, 8K, 128B would yield three fragments of exactly those
sizes. This is not optimal as one could fit those in just 2 fragments
of up to 8K size. This change makes the output_stream yield 8K and
256B fragments for this case.
output_stream can be used by only one fiber at a time so from
correctness point of view it doesn't matter if we set _end before or
after put(), but setting it before it allows us to have one future
less, which is a win.
Commit 405f3ea8c3 changed reactor so
that _network_stack is no longer default initialized to POSIX but to
nullptr. This caused tests to segfault, becayse they are not using
application template which takes care of configuration.
The fix is to call configure() so that netwrok stack will be set to
POSIX.