Commit Graph

465 Commits

Author SHA1 Message Date
Gleb Natapov
d4e3cafd10 net: start rx polling only after upper layer is ready to receive 2015-02-12 17:03:22 +02:00
Vlad Zolotarov
21f4c88c85 DPDK: zero_copy_tx - initial attempt
Send packets without copying fragments data:
   - Poll all the Tx descriptors and place them into a circular_buffer.
     We will take them from there when we need to send new packets.
   - PMD will return the completed buffers descriptors to the Tx mempool.
     This way we are going to know that we may release the buffer.
   - "move" the packet object into the last segment's descriptor's private data.
     When this fragment is completed means the whole packet has been sent
     and its memory may be released. So, we will do it by calling the packet's
     destructor.

Exceptions:
   - Copy if hugepages backend is not enabled.
   - Copy when we failed to send in a zero-copy flow (e.g. when we failed
     to translate a buffer virtual address).
   - Copy if first frag requires fragmentation below 128 bytes level - this is
     in order to avoid headers splitting.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v5:
   - NULL -> nullptr across the board.
   - Removed unused macros: MBUF_ZC_PRIVATE() and max_frags_zc.
   - Improved the local variables localization according to Nadav's remarks.
   - tx_buf class:
      - Don't regress the whole packet to the copy-send if a single fragment failed to be sent
        in a zero-copy manner (e.g. its data failed the virt2phys translation). Send only such a
        fragment in a copy way and try to send the rest of the fragments in a zero-copy way.
      - Make set_packet() receive packet&&.
      - Fixed the comments in check_frag0(): we check first 128 bytes and not first 2KB.
        starting from v2.
      - Use assert() instead of rte_exit() in do_one_frag().
      - Rename in set_one_data_buf() and in copy_one_data_buf(): l -> buf_len
      - Improve the assert about the size of private data in the tx_buf class:
         - Added two MARKER fields at the beginning and at the end of the private fields section
           which are going to be allocated on the mbuf's private data section.
         - Assert on the distance between these two markers.
      - Replace the sanity_check() (checks that packet doesn't have a zero-length) in a
        copy-flow by an assert() in a general function since this check
        is relevant both for a copy and for a zero-copy flows.
      - Make a sanity_check to be explicitly called frag0_check.
      - Make from_packet() receive packet&&.
      - In case frag0_check() fails - copy only the first fragment and
        not the whole packet.
   - tx_buf_factory class:
      - Change the interface to work with tx_buf* instead of tx_buf&.
      - Better utilize for-loop facilities in gc().
      - Kill the extra if() in the init_factory().
      - Use std::deque instead of circular_buffer for storing elements in tx_buf_factory.
      - Optimize the tx_buf_factory::get():
         - First take the completed buffers from the mempool and only if there
           aren't any - take from the factory's cache.
      - Make Tx mempools using cache: this significantly improves the performance despite the fact that it's
        not the right mempool configuration for a single-producer+single-consumer mode.
      - Remove empty() and size() methods.
   - Add comments near the assert()s in the fast-path.
   - Removed the not-needed "inline" qualifiers:
      - There is no need to specify "inline" qualifier for in-class defined
        methods INCLUDING static methods.
      - Defining process_packets() and poll_rx_once() as inline degraded the
        performance by about 1.5%.
   - Added a _tx_gc_poller: it will call tx_buf_factory::gc().
   - Don't check a pointer before calling free().
   - alloc_mempool_xmem(): Use posix_memalign() instead of memalign().

New in v4:
   - Improve the info messages.
   - Simplified the mempool name creation code.
   - configure.py: Opt-out the invalid-offsetof compilation warning.

New in v3:
   - Add missing macros definitions dropped in v2 by mistake.

New in v2:
   - Use Tx mbufs in a LIFO way for better cache utilization.
   - Lower the frag0 non-split thresh to 128 bytes.
   - Use new (iterators) semantics in circular_buffer.
   - Use optional<packet> for storing the packing in the mbuf.
   - Use rte_pktmbuf_alloc() instead of __rte_mbuf_raw_alloc().
   - Introduce tx_buf class:
      - Hide the private rte_mbuf area handling.
      - Hide packet to rte_mbuf cluster translation handling.
   - Introduce a "Tx buffers factory" class:
      - Hide the rte_mbuf flow details:
            mempool->circular_buffer->(PMD->)mempool
   - Templatization:
      - Make huge_pages_mem_backend a dpdk_qp class template parameter.
      - Unite the from_packet_xxx() code into a single template function.
      - Unite the translate_one_frag() and copy_one_frag() into a single
        template function.
2015-02-12 11:04:07 +02:00
Vlad Zolotarov
46b6644c35 DPDK: add a function that returns a number of bytes needed for each QP's mempool objects
This function is needed when we want to estimate a number of memory we want to give to DPDK
when we can provide a mempool an external memory buffer.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-11 19:27:12 +02:00
Vlad Zolotarov
82e20564b0 DPDK: Initialize mempools to work with external memory
If seastar is configured to use hugetlbfs initialize mempools
with external memory buffer. This way we are going to better control the overall
memory consumption.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v2:
   - Use char* instead of void* for pointer's arithmetics.
2015-02-11 19:27:12 +02:00
Vlad Zolotarov
d4cddbc3d0 DPDK: Use separate pools for Rx and Tx queues and adjust their sizes
There is no reason for Rx and Tx pools to be of the same size:

Rx pool is 3 times the ring size to give the upper layers some time
to free the Rx buffers before the ring stalls with no buffers.

Tx has absolutely different constraints: since it provides a back pressure
to the upper layers if HW doesn't keep up there is no need to allow more buffers
in the air than the amount we may send in a single rte_eth_tx_burst() call.
Therefore we need 2 times HW ring size buffers since HW may release the whole
ring of buffers in a single rte_eth_tx_burst() call and thus we may be able to
place another whole ring of buffers in the same call.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v4:
   - Fixed the info message.
2015-02-11 19:27:12 +02:00
Avi Kivity
29366cb076 net: add byteorder (ntoh/hton) variants for signed types 2015-02-09 17:07:21 +02:00
Asias He
f0c1bcdb33 tcp: Switch to debug print for persist timer
It is a left over during development.
2015-02-09 10:58:16 +02:00
Asias He
a192391ac6 tcp: Init timer callback using constructor 2015-02-09 10:58:15 +02:00
Asias He
0ac0e06d32 packet: Linearize after merge
The packet will be merged with the old packet anyway. Linearize after
the merge.
2015-02-09 10:58:15 +02:00
Asias He
abd5a24354 tcp: Implement persist timer
It is used to recover from a race where the sender is waiting for a
window update and the receiver is waiting for the sender to send more,
because somehow the window update carried in the ACK packet is not seen
by the sender.
2015-02-05 17:52:32 +08:00
Asias He
6a468dfd3d packet: Linearize more in packet_merger::merge
This fix tcp_server rxrx test on DPDK. The problem is that when we
receive out of order packets, we will hold the packet in the ooo queue.
We do linearize on the incoming packet which will copy the packet and
thus free the old packet. However, we missed one case where we need to
linearize. As a result, the original packet will be held in the ooo
queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk
buffer are in ooo queue, we will not be able to receive further packets.
So rx hangs, even ping will not work.
2015-02-05 17:52:32 +08:00
Asias He
4f21d500cb tcp: Do nothing if already in CLOSED state when close
This fix the following:

Server side:
$ tcp_server

Client side:
$ go run client.go -host 192.168.66.123 -conn 10 -test txtx
$ control-c

At this time, connection in tcp_server will be in CLOSED state (reset by
the remote), then tcp_server will call tcp::tcb::close() and wait for
wait_for_all_data_acked(), but no one will signal it. Thus we have tons
of leaked connection in CLOSED state.
2015-02-05 17:52:32 +08:00
Asias He
dd741d11b8 tcp: Fix FIN is not sent in some cases
We call output_one to make sure a packet with FIN is actually generated
and then sent out. If we only call output() and _packetq is not empty,
in tcp::tcb::get_packet(), packet with FIN will not be generated, thus
we will not send out a FIN.

This can happen when retransmit packets have been queued into _packetq,
then ACK comes which ACK all of the unacked data, then the application
call close() to close the connection.
2015-02-05 17:52:32 +08:00
Asias He
f600e3c902 tcp: Add queued_len
Take the number of queued data into account when checking if all
the data is sent.
2015-02-05 17:52:32 +08:00
Asias He
fca74f9563 tcp: Implement RFC6582 NewReno
We currently have RFC5681, a.k.a Reno TCP, as the congestion control
algorithms: slow start, congestion avoidance, fast retransmit, and fast
recovery. RFC6582 describes a specific algorithm for responding to
partial acknowledgments, referred to as NewReno, to improve Reno.
2015-02-05 17:45:48 +08:00
Asias He
426938f4ed tcp: Add Limited Transfer per RFC3042 and RFC5681
When RFC3042 is in use, additional data sent in limited transmit MUST
NOT be included in this calculation to update _snd.ssthresh.
2015-02-05 17:05:00 +08:00
Avi Kivity
42bc73a25d dpdk: initialize _tx_burst_idx
Should fix random segfault.
2015-01-29 11:18:54 +02:00
Asias He
0ab01d06ac tcp: Rework segment arrival handling
Follow RFC793 section "SEGMENT ARRIVES".

There are 4 major cases:

1) If the state is CLOSED
2) If the state is LISTEN
3) If the state is SYN-SENT
4) If the state is other state

Note:

- This change is significant (more than 10 pages in RFC793 describing
  this segment arrival handling).
- More test is needed. Good news is, so far, tcp_server(ping/txtx/rxrx)
  tests and httpd work fine.
2015-01-29 10:59:31 +02:00
Gleb Natapov
ada48a5213 net: use iterators to iterate over circular_buffer in dpdk 2015-01-28 13:49:09 +02:00
Gleb Natapov
7a92efe8d1 core: add local engine accessor function
Do not use thread local engine variable directly, but use accessor
instead.
2015-01-27 14:46:49 +02:00
Avi Kivity
d0ec99317d net: move some device and qp methods out-of-line 2015-01-22 09:44:44 +02:00
Avi Kivity
5678a0995e net: use a redirection table to forward packets to proxy queues
Build a 128-entry redirection table to select which cpu services which
packet, when we have more cores than queues (and thus need to dispatch
internally).

Add a --hw-queue-weight to control the relative weight of the hardware queue.
With a weight of 0, the core that services the hardware queue will not
process any packets; with a weight of 1 (default) it will process an equal
share of packets, compared to proxy queues.
2015-01-22 09:36:04 +02:00
Asias He
71ac2b5b24 tcp: Rename tcp::send()
Unlike tcp::tcb::send() and tcp::connection::send() which send tcp
packets associated with tcb, tcp::send() only send packets associated
without tcb. We have a bunch of send() functions, rename it to make the
code more readable.
2015-01-21 13:22:40 +02:00
Asias He
917247455c tcp: Use set_exception instead of set_value to notify user on rst 2015-01-21 11:20:06 +02:00
Asias He
8ce7cfd64b tcp: Fix listener port
It is supposed to zero the origin's port.
2015-01-21 11:20:05 +02:00
Asias He
0c09a6bd7a tcp: Return a future for tcp::connect() 2015-01-21 16:20:39 +08:00
Asias He
d6d7e6cb47 tcp: Support syn fin retransmit and timeout
Tested with tcp_server + client.go using iptables dropping <SYN,ACK> or
<FIN,ACK> on client side.

I verified that the SYN or FIN packet is retransmitted and the
connection is closed after N (currently 5) retries.
2015-01-21 16:20:39 +08:00
Asias He
602c7c9c98 tcp: Free packets to be sent on RST 2015-01-21 16:20:39 +08:00
Asias He
56f8ba3303 tcp: Clear more when tear down a connection 2015-01-19 13:32:53 +02:00
Gleb Natapov
23b9605fc4 net: fix virtio bulk sending
Zero net_hdr_mrg for each packet, otherwise wrong flags may be used.
2015-01-18 16:23:01 +02:00
Gleb Natapov
4920f309e8 net: fix dpdk bulk sending
_tx_burst_idx may be zero, but _tx_burst still holds packets from the top
of circular_buffer, so avoid refilling in this case.
2015-01-18 16:23:00 +02:00
Vlad Zolotarov
b8185d5afe DPDK: Set rss_bits to correct value if reta_size is not available
rss_bits should be equal to the number of bits HW used in its RSS calculation.
Use dev_info.max_rx_queues if dev_info.reta_size is not available.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-01-18 12:37:19 +02:00
Gleb Natapov
1ef2808750 reactor: return proper value from pollers
Currently most of them return true no matter if any work was actually
done. Fix it to return true only if poll did any work.
2015-01-18 12:37:05 +02:00
Takuya ASADA
bbe4d3b7d6 net: implemented SO_REUSEPORT support on UDP
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-01-15 17:38:06 +02:00
Asias He
f004db89cf tcp: Make tcp_option get_size and fill more readable 2015-01-15 10:31:46 +02:00
Takuya ASADA
16705be1f4 Distribute incomming connection by kernel using SO_REUSEPORT
With SO_REUSEPORT, we can bind() & accept() on each thread, kernel will dispatch incomming connections.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-01-13 20:04:32 +09:00
Vlad Zolotarov
585f6452b9 DPDK: Properly handle the case when RSS info is not available (e.g. VF case)
- Adjust the asserts.
 - Add an assert in the place where we should not get if RSS info is not provided.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
2015-01-12 18:45:48 +02:00
Gleb Natapov
f11e1f31d6 net: cleanup ipv4 code that is no longer used 2015-01-12 17:40:29 +02:00
Gleb Natapov
19ced3da4c net: fix dhcp to use use udp socket to send packets
No need for ad-hoc code to create udp packets.
2015-01-12 17:39:07 +02:00
Gleb Natapov
9c229d449a net: remove unused ipv4_l4::send() function
After previous patches this one is no longer used.
2015-01-12 17:38:44 +02:00
Gleb Natapov
d10575aea5 net: add tcp packet queue for non tcb packets
Some packets generated by tcp do not belong to any connection. Currently
such packets are pushed to ipv4 directly. This patch adds a packet queue
for ipv4 to poll them from and limits amount of memory those packets can
consume.
2015-01-12 17:35:35 +02:00
Gleb Natapov
fda06cd81f net: continue network stack inversion into icmp
Poll icmp from ipv4 instead of pushing packets from icmp to ipv4. Also
limit how much memory outstanding icmp packets can consume.
2015-01-12 17:34:19 +02:00
Gleb Natapov
6da00ab956 net: continue network stack inversion into udp
Poll udp channels from ipv4 instead of pushing packets from udp to ipv4
2015-01-12 17:34:10 +02:00
Avi Kivity
94ffb2c948 net: add missing includes to byteorder.hh 2015-01-12 14:11:56 +02:00
Gleb Natapov
bef054f8c8 net: rename udp_v4 to ipv4_udp for consistency with other l4 protocols 2015-01-11 12:29:05 +02:00
Gleb Natapov
32b42af49f net: register l3 poller for tcp connections
This patch change tcp to register a poller so that l3 can poll tcp for
a packet instead of pushing packets from tcp to ipv4. This pushes
networking tx path inversion a little bit closer to an application.
2015-01-11 10:48:32 +02:00
Gleb Natapov
d5c309c74e net: provide poller registration API between l3 and l4
Both push and pull methods will be supported between l3 and l4 after
this patch.
2015-01-11 10:17:48 +02:00
Gleb Natapov
2b340b80ce net: unfuturize packet fragmentation
Since sending of a single packet does not involve futures anymore we can
simplify this code.
2015-01-11 10:17:48 +02:00
Avi Kivity
4c59fe6436 xen: ensure _xenstore member is initialized early enough
Thanks clang.
2015-01-08 18:44:23 +02:00
Avi Kivity
062f621aa0 net: wrap toeplitz hash initializer in more braces
Nagging by clang.
2015-01-08 18:43:49 +02:00