This patch change tcp to register a poller so that l3 can poll tcp for
a packet instead of pushing packets from tcp to ipv4. This pushes
networking tx path inversion a little bit closer to an application.
L4 will provide the callback to be called by L3 after the packet is
handled to lower layers for transmission. L4 will know that it can queue
more data from user at this point. The patch also change send function
that can no longer block to return void instead of future<>.
The current shared_ptr implementation is efficient, but does not support
polymorphic types.
Rename it in order to make room for a polymorphic shared_ptr.
Provide a function that maps packet's rss hash to a cpu that should handle
it. This function is needed to find appropriate src port for outgoing
tcp/udp connection. Use this function to forward de-fragmented ip packet
to avoid one extra hop too.
Instead of forward() deciding packet destination make it collect input
for RSS hash function depending on packet type. After data is collected
use toeplitz hash function to calculate packet's destination.
Currently dhcp assumes that cpu 0 gets all the packets and redistributes
them by itself. With multiqueue this is not necessary the case, so the
current trick to disable forwarding by installing special dhcp forward()
function will not work. Rework it by installing packet filter on all
cpus before running dhcp and forward all dhcp packets to cpu 0.
This patch uses the NIC's capability to calculate in hardware the IP, TCP
and UDP checksums on outgoing packets, instead of us doing this on the
sending CPU. This can save us quite a bit of calculations (especially for
the TCP/UDP checksum of full-sized packets), and avoid cache-polution on
the CPU when sending cold data.
On my setup this patch improves the performance of a single-cpu memcached
by 6%. Together with the recent patch for receive-side checksum offloading,
the total improvement is 10%.
This patch is somewhat complicated by the fact we have so many different
combinations of checksum-offloading capabilities; While virtio can only
offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and
layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP
(e.g., ICMP), and some packets are not even IP (e.g., ARP), so this
patch modifies a few of the hardware-features flags and the per-packet
offload-information flags to fit our new needs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
ipv4::handle_on_cpu() did not properly convert from network byte order, so
it saw any packets with DF=1 as fragmented.
Fix by applying the proper conversion.
If a TCP or UDP IP datagram is fragmented, only the first fragment will
contain the port information. When a fragment without port information
is received, we have no idea which "stream" this fragment belongs to,
thus we no idea how to forward this packet.
To solve this problem, we use "forward twice" method. When IP datagram
which needs fragmentation is received, we forward it using the
frag_id(src_ip, dst_ip, identification, protocol) hash. When all the
fragments are received, we forward it using the connection_id(src_ip,
src_port, dst_ip, dst_port) hash.
Some packets, like arp replies, are broadcast to all cpus for handling,
but only packet structure is copied for each cpu, the actual packet data
is the same for all of them. Currently networking stack mangles a
packet data during its travel up the stack while doing ntoh()
translations which cannot obviously work for broadcaster packets. This
patches fixes the code to not modify packet data while doing ntoh(), but
do it in a stack allocated copy of a data instead.
Perhaps not the best way to enable "hijacking" the ip stack (for DHCP
querying), but considering the options seems the least intrusive.
Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
The Ethernet frame might contain extra bytes after the IP packet for
padding. Trim the extra bytes in order not to confuse TCP.
E.g. When doing TCP connection:
1) <SYN>
2) <SYN,ACK>
3) <ACK>
Packet 3) should be 14 + 20 + 20 = 54 bytes, the sender might send a
packet of size 60 bytes, containing 6 extra bytes for padding.
Fix httpd on ran/sif.
Classifier returns what cpu a packets should be processed on. It may
return special broadcast identifier. The patch includes classifier for
tcp, udp and arp. Arp classifier broadcasts arp reply to all cpus. Default
classifier does not forward packet.
Currently the send path buffers packets in (unbounded) _tx_queue and
in virtio ring. On queue would suffice though.
This change also prpagates the back pressure resulting from queue-full
condition up the send path. This is needed, becasue otherwise if
senders are faster than the network we will eventually run out of
memory. This would also cause a "buffer bloat" effect, which hurts
latency-sensitive workloads.
When reducing the checksum from a 32-bit or 64-bit intermediate,
we can get an overflow after the first overflow handling step:
0000_8000_8000_ffff
-> 10_ffff
-> 1_000f
-> 0010
Since we lacked the second step, we got an off-by-one in the checksum.