This patch introduce a logic to divide cpus between available hw queue
pairs. Each cpu with hw qp gets a set of cpus to distribute traffic
to. The algorithm doesn't take any topology considerations into account yet.
Move completion handling (destroy packet, adjust descriptors count) to
a completion function rather than a future. Reduces allocations and task
executed.
Move completion handling (destroy packet, adjust descriptors count) to
a completion function rather than a future. Reduces allocations and task
executed.
Currently vring request completions are handled by fulfilling a promise
contained in the request. While promises are very flexible, this comes
at a cost (allocating and executing a task), and this flexibility is unneeded
when request handling is very regular (such as in virtio-net rx and tx
completion handling).
Make vring more flexible by allowing the completion function to be specified
as a template parameter. No changes to the actual users - they now specify
the completion function as fulfilling the same promise as vring previously
did.
Instead of placing packets directly into the virtio ring, add them to
a temporary queue, and flush it when we are polled. This reduces
cross-cpu writes and kicks.
This patch uses the NIC's capability to calculate in hardware the IP, TCP
and UDP checksums on outgoing packets, instead of us doing this on the
sending CPU. This can save us quite a bit of calculations (especially for
the TCP/UDP checksum of full-sized packets), and avoid cache-polution on
the CPU when sending cold data.
On my setup this patch improves the performance of a single-cpu memcached
by 6%. Together with the recent patch for receive-side checksum offloading,
the total improvement is 10%.
This patch is somewhat complicated by the fact we have so many different
combinations of checksum-offloading capabilities; While virtio can only
offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and
layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP
(e.g., ICMP), and some packets are not even IP (e.g., ARP), so this
patch modifies a few of the hardware-features flags and the per-packet
offload-information flags to fit our new needs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.
The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).
This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
Tell host to interrupt less. This is useful for tx queue completion
since we do not care much when the tx is completed exactly.
Passed test with memcached and tcp_server.
Now that our reactor supports non-file-descriptor notification
mechanisms, switch to using one instead of eventfd when notifying
of virtio interrupts.
This will allow us to change the OSv enable_interrupt() code to
run the handler directly, not in a separate thread, because it
no longer needs to do sleepable write() to an eventfd file descriptor.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Currently there is an implicit unbounded queue between virtio driver
and networking stack where packets may accumulate if they are received
faster that networking stack can handle them. The queuing happen because
virtio buffer availability is signaled immediately after received buffer
promise is fulfilled, but promise fulfilment does not mean that buffer is
processed, only that task that will process it is placed on a task queue.
The patch fixes the problem by making virtio buffer available only after
previous buffer's completion task is executed. It makes the aforementioned
implicit queue between virtio driver and networking stack bound by virtio
ring size.
Instead of providing back pressure towards NIC, which will cause NIC to
slow down and drop packets, network stack should drop packets it cannot
handle by itself. Otherwise one slow receiver may cause drops for all
others. Our native network stack correctly drops packets instead of
providing feedback, so it is safe to just remove feedback from an API.
As a second option beyond running on Linux with vhost, this patch
allows Seastar to run in OSv with the virtio network device "assigned"
to the application (i.e., we use the virtio rings directly, with no OSv
involvement beyond the initial setup).
To use this feature, one needs to compile Seastar with the "HAVE_OSV"
flag, the osv::assigned_virtio::get() symbol needs to be available
(which means we run under OSv), and it should return a non-null object
(which means the OSv was run with --assign-net).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The wake_wait() method is only available for the notifier. Expose it
from the vring holding this notifier, and from the rx or tx queue holding
this vring.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Make virtio_net_device an abstract class, and move the vhost-specific
code to a subclass, virtio_net_device_vhost.
In a subsequent patch, we'll have a second subclass, for a virtio
device assigned from OSv.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
In the existing code, virt_to_phys() was a fixed do-nothing function.
This is good for vhost, but not good enough in OSv where the to convert
virtual addresses to physical we need an actual calculation.
The solution in this patch, using a virtual function, is not optimal
and should probably be replaced with a template later.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Currently, the "vring" class is hardcoded to do guest-host notifications
via eventfd. This patch switches to a general "notification object" with
two virtual functions - host_notify(), which unconditionally notifies the
host, and host_wait() which returns a future<> on which one can wait for
the host to notify us.
This patch provides one implementation of this notification object, using
eventfd as before, as needed when using vhost. We'll later provide a
different implementation for running under OSv.
This patch uses pointers and virtual functions; This adds a bit of
overhead to every notification, but it is small compared to the other
costs of these notifications. Nevertheless, we can change it in the
future to make the notification object a template parameter instead of
an abstract class.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Based on observation that with packets comprised of multiple fragments
vhost_get_vq_desc() goes higher in CPU profile. Avi suggested that the
current LIFO handling of free descriptors causes contention on cache
lines between seastar on vhost.
Gives 6-10% boost depending on hardware.
Instead of allocating a vector to store the buffers to be destroyed, in the
case of a single buffer, use an ordinary free deleter.
This doesn't currently help much because the packet is share()d later on,
but if we may be able to eliminate the sharing one day.