Unfortunately at_exit() cannot be used to delete objects since when
it runs the reactor is still active and deleted object may still been used.
We need another API that runs its task after reactor is already stopped.
at_destroy() will be such api.
Provide a function that maps packet's rss hash to a cpu that should handle
it. This function is needed to find appropriate src port for outgoing
tcp/udp connection. Use this function to forward de-fragmented ip packet
to avoid one extra hop too.
Some (all?) RSS capable HW provides us with a hash that was used to
select rx queue the packet was delivered to. If such hash is available
it is better to use it to forward packet instead of calculating hash
ourself and suffering cache missed.
This patch introduce a logic to divide cpus between available hw queue
pairs. Each cpu with hw qp gets a set of cpus to distribute traffic
to. The algorithm doesn't take any topology considerations into account yet.
Instead of forward() deciding packet destination make it collect input
for RSS hash function depending on packet type. After data is collected
use toeplitz hash function to calculate packet's destination.
This patch uses the NIC's capability to calculate in hardware the IP, TCP
and UDP checksums on outgoing packets, instead of us doing this on the
sending CPU. This can save us quite a bit of calculations (especially for
the TCP/UDP checksum of full-sized packets), and avoid cache-polution on
the CPU when sending cold data.
On my setup this patch improves the performance of a single-cpu memcached
by 6%. Together with the recent patch for receive-side checksum offloading,
the total improvement is 10%.
This patch is somewhat complicated by the fact we have so many different
combinations of checksum-offloading capabilities; While virtio can only
offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and
layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP
(e.g., ICMP), and some packets are not even IP (e.g., ARP), so this
patch modifies a few of the hardware-features flags and the per-packet
offload-information flags to fit our new needs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.
The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).
This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
With TSO enabled, we can see a Ethernet frame larger than 64K on tap
device. This makes wireshark unable to handle. It complains:
The capture file appears to be damaged or corrupt.
(pcapng_read_packet_block: cap_len 65549 is larger than
WTAP_MAX_PACKET_SIZE 65535.)
Only tx path is added for now. Will enable rx path when merge receive
buffer feature is supported in virtio-net.
UDP fragmentation offload is not hooked up, will do in a separate patch.
Classifier returns what cpu a packets should be processed on. It may
return special broadcast identifier. The patch includes classifier for
tcp, udp and arp. Arp classifier broadcasts arp reply to all cpus. Default
classifier does not forward packet.
Will need it later to handle forwarded packets. Also save net::device
pointer in thread local variable to get to device instance easily. When
we ill have more then one device per cpu we will have to change to
something more sophisticated.
If an L3 packet receiver is not able to register itself as a packet receiver
after processing a packet, or if it is simply not dispatched quickly enough,
then we will drop packets.
Add a queue at the protocol layer to buffer those packets.
Add a share() method that enables reference counting for the packet and
returns a clone. The packet's deleter will only be invoked after all clones
are destroyed.
This is useful for tcp, which keeps a packet in the unacknowledged transmit
queue while sending it lower down the stack.
When trimming the tcp header from a tcp packet without any data (such
as a SYN), nothing remains. trim_front() did not account for this,
and crashed.
Fix by checking for the condition.