This patch introduce a logic to divide cpus between available hw queue
pairs. Each cpu with hw qp gets a set of cpus to distribute traffic
to. The algorithm doesn't take any topology considerations into account yet.
This patch uses the NIC's capability to calculate in hardware the IP, TCP
and UDP checksums on outgoing packets, instead of us doing this on the
sending CPU. This can save us quite a bit of calculations (especially for
the TCP/UDP checksum of full-sized packets), and avoid cache-polution on
the CPU when sending cold data.
On my setup this patch improves the performance of a single-cpu memcached
by 6%. Together with the recent patch for receive-side checksum offloading,
the total improvement is 10%.
This patch is somewhat complicated by the fact we have so many different
combinations of checksum-offloading capabilities; While virtio can only
offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and
layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP
(e.g., ICMP), and some packets are not even IP (e.g., ARP), so this
patch modifies a few of the hardware-features flags and the per-packet
offload-information flags to fit our new needs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.
The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).
This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
If we don't have split channels, we need to delete the relevant property.
because xs_rm() returns true if the feature does not exist, it won't affect the
transaction if we just delete all of them. Therefore we don't need to do any
conditional test.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We are adding everything we read into the features array. Because in the
destructor we will remove everything in the features list, we'll end up
removing more than we should. Things like the mac address, handle, etc, should
never be deleted.
This is not a problem for OSv because usually, after the destructor is called,
the whole guest is down. But for userspace, the network card is left there,
but will cease to work if we delete too much.
After we do that with the _features array - it's original intent, it becomes
reduntant with features nack.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
This is not required for OSv, but is required for userspace operation.
It won't work without it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
When the backend advertises "feature-rx-copy", the frontend should register for
"request-rx-copy". The local hypervisor seems to be forgiving about it, but the
one in AWS, it is not, and doubly so.
First, it doesn't recognize these as the same. And second, it refuses to
connect the backend if this feature is not advertised by the frontend.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The ring processing is almost the same for both rx and tx, with the exception
with the core of the action. We can actually unify them nicely with some use of
template programming.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
There are two things we can do that will lead to less interrupts being sent.
The first, is to read the new rsp_cons value at the end of every interaction.
If the backend produces more frames in the mean time, we'll be able to process
in the same round, without getting another interrupt.
The other, is to set the rsp_event only after all the frames are processed.
As a matter of fact, both the tx and rx rings did one of them, but not the same
one. The next patch will unify the ring code to avoid problems like that in the
future.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The xen protocol needs works by filling positions in a circular ring. The
indexes become free to be used again when they are processed by the other side.
There is a problem, however: those indexes must be sequential, because all the
sides share is a produced / consumed index. But there are situations in which
we call get_index() - which produces an index X, but the .then() clause
schedules some other caller of send() to run in our place. That one, in turn,
can call get_index(), then create a packet with index X + 1 that will be put in
the ring before the packet with index X.
If the other end processes this packet very fast, it will respond saying "I
have processed packets up to X + 1". We will act on it as marking X as
processed as well - since it comes before X + 1, and when X is really
processed, chaos will ensue.
The solution for that is to just have the semaphore to count how many spaces we
have in the ring. Once we guarantee that the current caller have space, we then
compute get_index() inside the .then() clause. This works well because the
indexes are all sequential anyway.
For the same reason, we are actually able to remove the queue, and resort to a
simple counter. Once we know there is room, we just get the next index,
whatever it may be.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
we can't reach this place with a negative ref id, so let's assert to make sure
we're fine. Help catching some bugs.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The index in the ring and the packet id tends to be the same. But it doesn't
have to. There are some situations where the backend and the frontend get out
of sync with this, and this is totally valid.
One example is when the backend skb already have enough room to hold all of the
data being transmitted (netback.c, line 1611 @3.16). The netback will respond
immediately, even though there are other pending packets that are not yet fully
processed.
The ring index, then, must come from the rsp value, not from the req/rsp id.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The representation of an event channel as an integer poses a problem, in which
waiting on an integer port doesn't work well when the same event channel is
assigned for both tx and rx. The future will be ready for one of the sides, but
we won't process the other.
One alternative is to have conditions in the future processing, and in case the
event channels are bound to the same port, process both events. But a better
solution is to use a class to represent the bound ports, and instances of those
classes will have their own pending methods.
Infrastructure will be written in a following patch to make sure that all
listeners to the same port will be made ready when an interrupt kicks in
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The backend may be completely silent about the existence of the split channels feature.
In that case, trying to read through the template directly would cause an exception,
since we can't convert the empty string.
The backend-id, OTOH, is guaranteed to exist and wasn't using the template signature.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We copy our grant reference into a temporary, so free_ref() does not
clear the real entry, causing an assert() to trigger later on.
Fix by capturing the grant reference entry by reference.
With this, the xen network driver survives multiple trips around the ring.
Keep recycling free ring entries back into the receive ring so we can
receive more than 256 packets.
The code is a little lame at the moment since it writes the index and
notifies the host for every frame, but that can be adjusted later.
There is no reason to wait when pushing back a free id - there is nothing
that could possibly block there.
Switch from a queue<> to an std::queue<> and use a semaphore to guard
popping from the queue.
handle buffer recycles. Right now it is very simple: allocate a new receive
buffer after a succesful receival, and mark the tx spot free when we get the tx
event notification.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Instead of returning a reference to a grant that is already present in an
array, defer the initialization. This is how the OSv driver handles it, and I
honestly am not sure if this is really needed: it seems to me we should be able
to just reuse the old grants. I need to check in the backend code if we can be
any smarter than this.
However, right now we need to do something to recycle the buffers, and just
re-doing the refs would lead to inconsistencies. So the best by now is to close
and reopen the grants, and then later on rework this in a way that works for
both the initial setup and the recycle.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Right now, we allocate the whole index, and notify the backend that we have
produced nr_ents indexes. If we do that, we cannot increment the producer index
when we receive a new package. This would make the index overflow, and
basically, it is the responsible for the biggest part of the slowdown we are
seeing.
Before this patch, we're seeing 2s RTT for pings. After the patch:
64 bytes from 192.168.100.79: icmp_seq=1 ttl=64 time=0.437 ms
64 bytes from 192.168.100.79: icmp_seq=2 ttl=64 time=0.431 ms
64 bytes from 192.168.100.79: icmp_seq=3 ttl=64 time=0.475 ms
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Aside from managing the grant references, we also need to manage the positional
indexes in the array. We need to keep track of which indexes are free, and
which are used. Because we need the actual position number to fill xen's data
structures, I figured we could use a queue and then fill it up with all the
integers in our range. The queue is already futurized, so that's easy.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Checksum offload cannot be disabled in Xen (or at least, I haven't figured
out how). Advertise it as enabled, so that tcp doesn't drop packets as
failing their checksum.
Still need to flesh out the transmit path.
With this, seastar sends SYN/ACK packets in response to connection requests.
We prepared N buffers, but only told the host about one. This meant the host
stopped forwarding received packets almost immediately.
Fix by writing the Xen-visible ring index correctly.
This is the basic support for xenfront. It can be used in domU, provided there
is a network interface to be hijacked.
The code that follows, is just the mechanics of managing the grants, event
channels, etc.
However, it does not yet work: I can't see netback injecting any data into it.
I am still debugging the protocol, but I wanted to flush the current state.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>