Some (all?) RSS capable HW provides us with a hash that was used to
select rx queue the packet was delivered to. If such hash is available
it is better to use it to forward packet instead of calculating hash
ourself and suffering cache missed.
This patch uses the NIC's capability to calculate in hardware the IP, TCP
and UDP checksums on outgoing packets, instead of us doing this on the
sending CPU. This can save us quite a bit of calculations (especially for
the TCP/UDP checksum of full-sized packets), and avoid cache-polution on
the CPU when sending cold data.
On my setup this patch improves the performance of a single-cpu memcached
by 6%. Together with the recent patch for receive-side checksum offloading,
the total improvement is 10%.
This patch is somewhat complicated by the fact we have so many different
combinations of checksum-offloading capabilities; While virtio can only
offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and
layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP
(e.g., ICMP), and some packets are not even IP (e.g., ARP), so this
patch modifies a few of the hardware-features flags and the per-packet
offload-information flags to fit our new needs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The hand-rolled deleter chaining in packet::append was invalidated
by the make_free_deleter() optimization, since deleter->_next is no longer
guaranteed to be valid (and deleter::operator->() is still exposed, despite
that).
Switch to deleter::append(), which does the right thing.
Fixes a memory leak in tcp_server.
If a TCP or UDP IP datagram is fragmented, only the first fragment will
contain the port information. When a fragment without port information
is received, we have no idea which "stream" this fragment belongs to,
thus we no idea how to forward this packet.
To solve this problem, we use "forward twice" method. When IP datagram
which needs fragmentation is received, we forward it using the
frag_id(src_ip, dst_ip, identification, protocol) hash. When all the
fragments are received, we forward it using the connection_id(src_ip,
src_port, dst_ip, dst_port) hash.
Add packet(Iterator, Iterator, deleter).
(unfortunately we have both a template version with a template parameter
named Deleter, and a non-template version with a parameter called deleter.
Need to sort the naming out).
Some packets are processed by a cpu other than the one that allocates it
and its fragments. free_on_cpu() function should be called on a cpu that
does processing and it returns a packet that is deletable by the current
cpu. It is done by copying packet/packet::impl to locally allocated one
and adding new deleter that runs old deleter on original cpu.
Move reference counting into the deleter core, instead of relegating it
to a shared_deleter (which has to be allocated) and an external reference
counted (also allocated). This dramatically reduces dynamic allocations.
Instead of using internal_deleter, which is unwieldy, store the
header data inside packet::impl which we're allocating anyway.
This adds some complication when we need to reallocate impl (if
the number of fragments overflows), but usually saves two allocations:
one for the internal_deleter and one for the data itself.
deleter::share() is causing massive amounts of allocation. First,
since usually a packet's deleter is not a shared_deleter, we need to
allocate that shared_deleter. Second, we need an external reference
count which requires yet another allocation.
Making reference counting part of the deleter class would solve both of
these problems, but we cannot easily do that, since users hold
std::unique_ptr<deleter> which is clearly not sharable.
We could do a massive s/unique_ptr/shared_ptr/ here, but that would have
the side effect of making sharing "too easy" - you simply copy the pointer.
We'd like to keep it explicit.
So to make the change easier, rename the existing unique_ptr<deleter> as
plain "deleter", whereas the old "deleter" becomes deleter::impl:
old name new name
-------- --------
deleter deleter::impl
unique_ptr<deleter> deleter
with exactly the same semantics. A later patch can then add explicit sharing.
Instead of having an std::vector<> manage the fragment array,
allocate it at the end of the impl struct and manage it manually.
The result isn't pretty but it does remove an allocation.
Move all data fields into an 'impl' struct (pimpl idiom) so that move()ing
a packet becomes very cheap. The downside is that we need an extra
allocation, but we can later recover that by placing the fragment array
in the same structure.
Even with the extra allocation, performance is up ~10%.