This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.
The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).
This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
Running tcp stream test with --smp > 1, sometimes the server sends TSO
frame, sometimes it does not. If we set --smp = 1, the server always
sends TSO frame. This is because the proxy device does not parse all the
features in the opts. We should copy the _hw_features from the real
device but it is not easy. For now, we simply duplicate the parse code.
Currently net::proxy::send() waits for previous packet to be sent to
another cpu before accepting next packet. This slows down sender to much.
This patch implement virtual queue that allows many packets to be send
simultaneously but it starts to drop packets when number of outstanding
packets goes over the limit, if we will try to queue them we will run
out of memory if a sender generates packets faster that they can be
sent. It also tries to generate back pressure by returning a feature
that will become ready when queue goes under the limit, but it returns
it only for a first sender that goes over it. Doing this for all callers
may be an option too, but it is not clear which one and how many should
we wake when queue goes under the limit again.
Only tx path is added for now. Will enable rx path when merge receive
buffer feature is supported in virtio-net.
UDP fragmentation offload is not hooked up, will do in a separate patch.
Only one cpu can talk to virtio NIC directly (at least without
multiqueue), so other cpus receive packets from the one that drives
virtio and forward packets that need to be send back. This is done
by proxy net interface.