Set RTO (retransmission timer) according to RFC6298. Now, we have a
dynamic RTO istead of the hard coded 3 seconds, and an exponential
back-off timer for retransmission.
Tell host to interrupt less. This is useful for tx queue completion
since we do not care much when the tx is completed exactly.
Passed test with memcached and tcp_server.
When doing tcp rx testing, I saw a lot of retransmission because of the
delayed ACK. Our current delayed ACK algorithm does not comply with
what RFC 1122 suggests.
As described in RFC 1122, a host may delay sending an ACK response by up
to 500 ms. Additionally, with a stream of full-sized incoming segments,
ACK responses must be sent for every second segment.
=== Before ===
[asias@hjpc pingpong]$ go run client-rxrx.go
Bytes Sent(MiB): 100
Total Time(Secs): 322.620879376
Bandwidth(MiB/Sec): 0.30996133974160595
78 2.412385 192.168.66.100 -> 192.168.66.123 TCP 32174 37672 > 10000
[ACK] Seq=2149425323 Ack=1000001 Win=229 Len=32120
79 2.612985 192.168.66.100 -> 192.168.66.123 TCP 1514 [TCP Retransmission]
37672 > 10000 [ACK] Seq=2149425323 Ack=1000001 Win=229 Len=1460
80 2.613131 192.168.66.123 -> 192.168.66.100 TCP 54 10000 > 37672
[ACK] Seq=1000001 Ack=2149457443 Win=29200 Len=0
=== After ===
[asias@hjpc pingpong]$ go run client-rxrx.go
Bytes Sent(MiB): 100
Total Time(Secs): 0.244951095
Bandwidth(MiB/Sec): 408.2447559583271
No retransmission is seen.
Now that our reactor supports non-file-descriptor notification
mechanisms, switch to using one instead of eventfd when notifying
of virtio interrupts.
This will allow us to change the OSv enable_interrupt() code to
run the handler directly, not in a separate thread, because it
no longer needs to do sleepable write() to an eventfd file descriptor.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
From Tomasz:
"There will be now a separate DB per core, each serving a subset of the key
space (sharding). From the outside in appears to behave as one DB."
POSIX stack does not allow one to bind more than one socket to given
port. Native stack on the other hand does. The way services are set up
depends on that. For instance, on native stack one might want to start
the service on all cores, but on POSIX stack only on one of them.
If we don't, we start the system before we have an IP address, and when
we actually do get the IP address, we fail an assert on the _config promise,
which was already fulfilled.
The representation of an event channel as an integer poses a problem, in which
waiting on an integer port doesn't work well when the same event channel is
assigned for both tx and rx. The future will be ready for one of the sides, but
we won't process the other.
One alternative is to have conditions in the future processing, and in case the
event channels are bound to the same port, process both events. But a better
solution is to use a class to represent the bound ports, and instances of those
classes will have their own pending methods.
Infrastructure will be written in a following patch to make sure that all
listeners to the same port will be made ready when an interrupt kicks in
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The backend may be completely silent about the existence of the split channels feature.
In that case, trying to read through the template directly would cause an exception,
since we can't convert the empty string.
The backend-id, OTOH, is guaranteed to exist and wasn't using the template signature.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We copy our grant reference into a temporary, so free_ref() does not
clear the real entry, causing an assert() to trigger later on.
Fix by capturing the grant reference entry by reference.
With this, the xen network driver survives multiple trips around the ring.
Keep recycling free ring entries back into the receive ring so we can
receive more than 256 packets.
The code is a little lame at the moment since it writes the index and
notifies the host for every frame, but that can be adjusted later.
There is no reason to wait when pushing back a free id - there is nothing
that could possibly block there.
Switch from a queue<> to an std::queue<> and use a semaphore to guard
popping from the queue.
Running tcp stream test with --smp > 1, sometimes the server sends TSO
frame, sometimes it does not. If we set --smp = 1, the server always
sends TSO frame. This is because the proxy device does not parse all the
features in the opts. We should copy the _hw_features from the real
device but it is not easy. For now, we simply duplicate the parse code.
Fix tcp_server tx test. We still have more to do.
Native stack:
$ go run client-txtx.go
Bytes Received(MiB): 1000
Total Time(Secs): 1.567927562
Bandwidth(MiB/Sec): 637.7845662234746
Posix stack:
$ go run client-txtx.go
Bytes Received(MiB): 1000
Total Time(Secs): 1.014354958
Bandwidth(MiB/Sec): 985.8481906291427
Note: client-txtx uses 100 concurrent connections.
With TSO enabled, we can see a Ethernet frame larger than 64K on tap
device. This makes wireshark unable to handle. It complains:
The capture file appears to be damaged or corrupt.
(pcapng_read_packet_block: cap_len 65549 is larger than
WTAP_MAX_PACKET_SIZE 65535.)
handle buffer recycles. Right now it is very simple: allocate a new receive
buffer after a succesful receival, and mark the tx spot free when we get the tx
event notification.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Instead of returning a reference to a grant that is already present in an
array, defer the initialization. This is how the OSv driver handles it, and I
honestly am not sure if this is really needed: it seems to me we should be able
to just reuse the old grants. I need to check in the backend code if we can be
any smarter than this.
However, right now we need to do something to recycle the buffers, and just
re-doing the refs would lead to inconsistencies. So the best by now is to close
and reopen the grants, and then later on rework this in a way that works for
both the initial setup and the recycle.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Right now, we allocate the whole index, and notify the backend that we have
produced nr_ents indexes. If we do that, we cannot increment the producer index
when we receive a new package. This would make the index overflow, and
basically, it is the responsible for the biggest part of the slowdown we are
seeing.
Before this patch, we're seeing 2s RTT for pings. After the patch:
64 bytes from 192.168.100.79: icmp_seq=1 ttl=64 time=0.437 ms
64 bytes from 192.168.100.79: icmp_seq=2 ttl=64 time=0.431 ms
64 bytes from 192.168.100.79: icmp_seq=3 ttl=64 time=0.475 ms
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Aside from managing the grant references, we also need to manage the positional
indexes in the array. We need to keep track of which indexes are free, and
which are used. Because we need the actual position number to fill xen's data
structures, I figured we could use a queue and then fill it up with all the
integers in our range. The queue is already futurized, so that's easy.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Some packets, like arp replies, are broadcast to all cpus for handling,
but only packet structure is copied for each cpu, the actual packet data
is the same for all of them. Currently networking stack mangles a
packet data during its travel up the stack while doing ntoh()
translations which cannot obviously work for broadcaster packets. This
patches fixes the code to not modify packet data while doing ntoh(), but
do it in a stack allocated copy of a data instead.
Fixes the following link errors when Xen support is disabled:
build/release/net/native-stack.o: In function `net::add_native_net_options_description(boost::program_options::options_description&)':
/seastar/net/native-stack.cc:101: undefined reference to `get_xenfront_net_options_description()'
build/release/net/native-stack.o: In function `net::create_native_net_device(boost::program_options::variables_map)':
/seastar/net/native-stack.cc:93: undefined reference to `create_xenfront_net_device(boost::program_options::variables_map, bool)'
collect2: error: ld returned 1 exit status
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
Checksum offload cannot be disabled in Xen (or at least, I haven't figured
out how). Advertise it as enabled, so that tcp doesn't drop packets as
failing their checksum.
Still need to flesh out the transmit path.
With this, seastar sends SYN/ACK packets in response to connection requests.