scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Author	SHA1	Message	Date
Vlad Zolotarov	e7e58d446c	DPDK: Explicitly set Rx mempool mbuf_data_room_size This value is passed as an opaque parameter of the rte_pktmbuf_pool_init(). It should equal to a buffer size + RTE_PKTMBUF_HEADROOM. The default value is 2K + RTE_PKTMBUF_HEADROOM. PMD is using this value minus RTE_PKTMBUF_HEADROOM for configuring the Rx data buffers' size when it configures the Rx HW ring. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-03-05 18:39:17 +02:00
Tomasz Grabiec	83963b23d3	Replace rescue() usages with then_wrapped() They are pretty much the same. This change removes rescue().	2015-03-04 17:34:59 +01:00
Asias He	55a43eb5f9	tcp: Improve retransmit Do not store the tcp header in unacked queue. When partial ack of a segment happens, trim off the acked part of a segment. When retransmits, recalculate the tcp header and retransmit only the unacked part.	2015-03-02 12:36:47 +02:00
Avi Kivity	3aaac55a7a	tcp: fix bad checksum on RST packets We neglected to set offload_info::needs_csum on reset packets, resuling in them being ignored by the recipient. Symptoms include connection attempts to closed ports (seastar = passive end) hanging instead the active end understanding the port is closed.	2015-03-01 18:45:35 +02:00
Avi Kivity	27913013b1	virtio: tighten rx packets for debug mode Allocate exactly the available fragment size in order to catch buffer overflows. We get similar behaviour in dpdk, since without huge pages, it must copy the packet into a newly allocated buffer.	2015-03-01 16:42:07 +02:00
Avi Kivity	cefe6a9b3e	tcp: fix option parsing Two bugs: 1. get_header<type>(offset) was used with size given as the offset 2. opt_end failed to account for the mandatory tcp header, and thus was 20 bytes to large, resulting in overflow.	2015-03-01 15:26:46 +02:00
Tomasz Grabiec	1fadd7d608	net: Move ipv4_addr constructor to the source file boost::join() provided by boost/algorithm/string.hpp conflicts with boost::join() from boost/range/join.hpp. It looks like a boost issue but let's not pollute the namespace unnecesssarily. Regarding the change in configure.py, it looks like scollectd.cc is part of the 'core' package, but it needs 'net/api.hh', so I added 'net/net.cc' to core.	2015-02-26 17:34:27 +02:00
Vlad Zolotarov	e752975b08	DPDK: Copy the Rx data into the allocated buffer in a non-decopuled case If data buffers decoupling from the rte_mbuf is not available (hugetlbfs is not available)copy the newly received data into the memory buffer we allocate and build the "packet" object from this buffer. This will allow us returning the rte_mbuf immediately which would solve the same issue the "decoupling" is solving when hugetlbfs is available. The implementation is simplistic (no preallocation, packet data cache alignment, etc.). Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-19 16:59:48 +02:00
Vlad Zolotarov	06565c80d5	DPDK: Decouple Rx data buffers - Allocate the data buffers instead of using the default inline rte_mbuf layout. - Implement an rx_gc() and add an _rx_gc_poller to call it: we will refill the rx mbuf's when at least 64 free buffers. This threshold has been chosen as a sane enough number. - Introduce the mbuf_data_size == 4K. Allocate 4K buffers for a detached flow. We are still going to allocate 2K data buffers for an inline case since 4K buffers would require 2 pages per mbuf due to "mbuf_overhead". Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-19 16:58:54 +02:00
Vlad Zolotarov	a4345fa9bf	DPDK: Rename: mbuf_size -> inline_mbuf_size and mbuf_data_size -> inline_mbuf_data_size Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-19 16:58:52 +02:00
Vlad Zolotarov	e04cb93d2f	DPDK: Use std::vector instead of std::deque for a tx_buf_factory internal cache std::vector is promised to be continuous storage while std::deque is not. In addition std::vector's semantics yields a simpler code than those of deque. Therefore std::vector should deliver a better performance for a stack semantics we need here. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-19 16:58:52 +02:00
Vlad Zolotarov	d22af60057	DPDK: Fix the mempool external buffer size calculation Take into an account the alignment, header and trailer that mempool is adding to the elements. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-19 16:58:51 +02:00
Gleb Natapov	bebefe2afe	net: return reference to hw_feature instead of copying the structure I noticed that tcp::hw_features() is not inlined and copies the structure to a caller. The function takes ~1.5% in httpd profiling.	2015-02-19 16:58:50 +02:00
Avi Kivity	7f8d88371a	Add LICENSE, NOTICE, and copyright headers to all source files. The two files imported from the OSv project retain their original licenses.	2015-02-19 16:52:34 +02:00
Gleb Natapov	c4c5899f89	net: handle arp resolution errors in tcp Pass timeouts up the calling chain and schedule retry if waiter list is too long.	2015-02-18 20:12:08 +02:00
Gleb Natapov	1cfaa7eefe	net: populate dpdk redirection table even if there is only one queue tcp::connect() uses redirection table to figure out what queue will handle a connection.	2015-02-18 16:52:56 +02:00
Vlad Zolotarov	1934160549	DPDK: Add TSO support - tcp.hh: Properly calculate the pseudo-header in the TSO case: it should be calculated as if ip_len is zero. - Enable TSO in the DPDK network backend. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-17 12:47:13 +02:00
Tomasz Grabiec	c11de0476e	net: Add overload of ntoh()/hton() for int8_t/uint8_t They're no-op but make templating easier.	2015-02-16 20:26:36 +02:00
Vlad Zolotarov	b8cc243b17	tcp: Pass the correct value of TSO segment size downstream Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-15 19:11:06 +02:00
Vlad Zolotarov	d82efca3a8	DPDK: Use std::unique_ptr for storring _xmem blobs Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-12 18:54:46 +02:00
Vlad Zolotarov	95bf98977d	DPDK: Recover the DPDK 1.7.x support - Define MARKER type if not defined. - Adjust the Tx zero-copy to the rte_mbuf layout in DPDK 1.7.x. - README.md: - Bump up the DPDK latest version to 1.8.0. - Add a new DPDK configuration description. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-12 18:54:05 +02:00
Gleb Natapov	d4e3cafd10	net: start rx polling only after upper layer is ready to receive	2015-02-12 17:03:22 +02:00
Vlad Zolotarov	21f4c88c85	DPDK: zero_copy_tx - initial attempt Send packets without copying fragments data: - Poll all the Tx descriptors and place them into a circular_buffer. We will take them from there when we need to send new packets. - PMD will return the completed buffers descriptors to the Tx mempool. This way we are going to know that we may release the buffer. - "move" the packet object into the last segment's descriptor's private data. When this fragment is completed means the whole packet has been sent and its memory may be released. So, we will do it by calling the packet's destructor. Exceptions: - Copy if hugepages backend is not enabled. - Copy when we failed to send in a zero-copy flow (e.g. when we failed to translate a buffer virtual address). - Copy if first frag requires fragmentation below 128 bytes level - this is in order to avoid headers splitting. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v5: - NULL -> nullptr across the board. - Removed unused macros: MBUF_ZC_PRIVATE() and max_frags_zc. - Improved the local variables localization according to Nadav's remarks. - tx_buf class: - Don't regress the whole packet to the copy-send if a single fragment failed to be sent in a zero-copy manner (e.g. its data failed the virt2phys translation). Send only such a fragment in a copy way and try to send the rest of the fragments in a zero-copy way. - Make set_packet() receive packet&&. - Fixed the comments in check_frag0(): we check first 128 bytes and not first 2KB. starting from v2. - Use assert() instead of rte_exit() in do_one_frag(). - Rename in set_one_data_buf() and in copy_one_data_buf(): l -> buf_len - Improve the assert about the size of private data in the tx_buf class: - Added two MARKER fields at the beginning and at the end of the private fields section which are going to be allocated on the mbuf's private data section. - Assert on the distance between these two markers. - Replace the sanity_check() (checks that packet doesn't have a zero-length) in a copy-flow by an assert() in a general function since this check is relevant both for a copy and for a zero-copy flows. - Make a sanity_check to be explicitly called frag0_check. - Make from_packet() receive packet&&. - In case frag0_check() fails - copy only the first fragment and not the whole packet. - tx_buf_factory class: - Change the interface to work with tx_buf* instead of tx_buf&. - Better utilize for-loop facilities in gc(). - Kill the extra if() in the init_factory(). - Use std::deque instead of circular_buffer for storing elements in tx_buf_factory. - Optimize the tx_buf_factory::get(): - First take the completed buffers from the mempool and only if there aren't any - take from the factory's cache. - Make Tx mempools using cache: this significantly improves the performance despite the fact that it's not the right mempool configuration for a single-producer+single-consumer mode. - Remove empty() and size() methods. - Add comments near the assert()s in the fast-path. - Removed the not-needed "inline" qualifiers: - There is no need to specify "inline" qualifier for in-class defined methods INCLUDING static methods. - Defining process_packets() and poll_rx_once() as inline degraded the performance by about 1.5%. - Added a _tx_gc_poller: it will call tx_buf_factory::gc(). - Don't check a pointer before calling free(). - alloc_mempool_xmem(): Use posix_memalign() instead of memalign(). New in v4: - Improve the info messages. - Simplified the mempool name creation code. - configure.py: Opt-out the invalid-offsetof compilation warning. New in v3: - Add missing macros definitions dropped in v2 by mistake. New in v2: - Use Tx mbufs in a LIFO way for better cache utilization. - Lower the frag0 non-split thresh to 128 bytes. - Use new (iterators) semantics in circular_buffer. - Use optional<packet> for storing the packing in the mbuf. - Use rte_pktmbuf_alloc() instead of __rte_mbuf_raw_alloc(). - Introduce tx_buf class: - Hide the private rte_mbuf area handling. - Hide packet to rte_mbuf cluster translation handling. - Introduce a "Tx buffers factory" class: - Hide the rte_mbuf flow details: mempool->circular_buffer->(PMD->)mempool - Templatization: - Make huge_pages_mem_backend a dpdk_qp class template parameter. - Unite the from_packet_xxx() code into a single template function. - Unite the translate_one_frag() and copy_one_frag() into a single template function.	2015-02-12 11:04:07 +02:00
Vlad Zolotarov	46b6644c35	DPDK: add a function that returns a number of bytes needed for each QP's mempool objects This function is needed when we want to estimate a number of memory we want to give to DPDK when we can provide a mempool an external memory buffer. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-11 19:27:12 +02:00
Vlad Zolotarov	82e20564b0	DPDK: Initialize mempools to work with external memory If seastar is configured to use hugetlbfs initialize mempools with external memory buffer. This way we are going to better control the overall memory consumption. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v2: - Use char* instead of void* for pointer's arithmetics.	2015-02-11 19:27:12 +02:00
Vlad Zolotarov	d4cddbc3d0	DPDK: Use separate pools for Rx and Tx queues and adjust their sizes There is no reason for Rx and Tx pools to be of the same size: Rx pool is 3 times the ring size to give the upper layers some time to free the Rx buffers before the ring stalls with no buffers. Tx has absolutely different constraints: since it provides a back pressure to the upper layers if HW doesn't keep up there is no need to allow more buffers in the air than the amount we may send in a single rte_eth_tx_burst() call. Therefore we need 2 times HW ring size buffers since HW may release the whole ring of buffers in a single rte_eth_tx_burst() call and thus we may be able to place another whole ring of buffers in the same call. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v4: - Fixed the info message.	2015-02-11 19:27:12 +02:00
Avi Kivity	29366cb076	net: add byteorder (ntoh/hton) variants for signed types	2015-02-09 17:07:21 +02:00
Asias He	f0c1bcdb33	tcp: Switch to debug print for persist timer It is a left over during development.	2015-02-09 10:58:16 +02:00
Asias He	a192391ac6	tcp: Init timer callback using constructor	2015-02-09 10:58:15 +02:00
Asias He	0ac0e06d32	packet: Linearize after merge The packet will be merged with the old packet anyway. Linearize after the merge.	2015-02-09 10:58:15 +02:00
Asias He	abd5a24354	tcp: Implement persist timer It is used to recover from a race where the sender is waiting for a window update and the receiver is waiting for the sender to send more, because somehow the window update carried in the ACK packet is not seen by the sender.	2015-02-05 17:52:32 +08:00
Asias He	6a468dfd3d	packet: Linearize more in packet_merger::merge This fix tcp_server rxrx test on DPDK. The problem is that when we receive out of order packets, we will hold the packet in the ooo queue. We do linearize on the incoming packet which will copy the packet and thus free the old packet. However, we missed one case where we need to linearize. As a result, the original packet will be held in the ooo queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk buffer are in ooo queue, we will not be able to receive further packets. So rx hangs, even ping will not work.	2015-02-05 17:52:32 +08:00
Asias He	4f21d500cb	tcp: Do nothing if already in CLOSED state when close This fix the following: Server side: $ tcp_server Client side: $ go run client.go -host 192.168.66.123 -conn 10 -test txtx $ control-c At this time, connection in tcp_server will be in CLOSED state (reset by the remote), then tcp_server will call tcp::tcb::close() and wait for wait_for_all_data_acked(), but no one will signal it. Thus we have tons of leaked connection in CLOSED state.	2015-02-05 17:52:32 +08:00
Asias He	dd741d11b8	tcp: Fix FIN is not sent in some cases We call output_one to make sure a packet with FIN is actually generated and then sent out. If we only call output() and _packetq is not empty, in tcp::tcb::get_packet(), packet with FIN will not be generated, thus we will not send out a FIN. This can happen when retransmit packets have been queued into _packetq, then ACK comes which ACK all of the unacked data, then the application call close() to close the connection.	2015-02-05 17:52:32 +08:00
Asias He	f600e3c902	tcp: Add queued_len Take the number of queued data into account when checking if all the data is sent.	2015-02-05 17:52:32 +08:00
Asias He	fca74f9563	tcp: Implement RFC6582 NewReno We currently have RFC5681, a.k.a Reno TCP, as the congestion control algorithms: slow start, congestion avoidance, fast retransmit, and fast recovery. RFC6582 describes a specific algorithm for responding to partial acknowledgments, referred to as NewReno, to improve Reno.	2015-02-05 17:45:48 +08:00
Asias He	426938f4ed	tcp: Add Limited Transfer per RFC3042 and RFC5681 When RFC3042 is in use, additional data sent in limited transmit MUST NOT be included in this calculation to update _snd.ssthresh.	2015-02-05 17:05:00 +08:00
Avi Kivity	42bc73a25d	dpdk: initialize _tx_burst_idx Should fix random segfault.	2015-01-29 11:18:54 +02:00
Asias He	0ab01d06ac	tcp: Rework segment arrival handling Follow RFC793 section "SEGMENT ARRIVES". There are 4 major cases: 1) If the state is CLOSED 2) If the state is LISTEN 3) If the state is SYN-SENT 4) If the state is other state Note: - This change is significant (more than 10 pages in RFC793 describing this segment arrival handling). - More test is needed. Good news is, so far, tcp_server(ping/txtx/rxrx) tests and httpd work fine.	2015-01-29 10:59:31 +02:00
Gleb Natapov	ada48a5213	net: use iterators to iterate over circular_buffer in dpdk	2015-01-28 13:49:09 +02:00
Gleb Natapov	7a92efe8d1	core: add local engine accessor function Do not use thread local engine variable directly, but use accessor instead.	2015-01-27 14:46:49 +02:00
Avi Kivity	d0ec99317d	net: move some device and qp methods out-of-line	2015-01-22 09:44:44 +02:00
Avi Kivity	5678a0995e	net: use a redirection table to forward packets to proxy queues Build a 128-entry redirection table to select which cpu services which packet, when we have more cores than queues (and thus need to dispatch internally). Add a --hw-queue-weight to control the relative weight of the hardware queue. With a weight of 0, the core that services the hardware queue will not process any packets; with a weight of 1 (default) it will process an equal share of packets, compared to proxy queues.	2015-01-22 09:36:04 +02:00
Asias He	71ac2b5b24	tcp: Rename tcp::send() Unlike tcp::tcb::send() and tcp::connection::send() which send tcp packets associated with tcb, tcp::send() only send packets associated without tcb. We have a bunch of send() functions, rename it to make the code more readable.	2015-01-21 13:22:40 +02:00
Asias He	917247455c	tcp: Use set_exception instead of set_value to notify user on rst	2015-01-21 11:20:06 +02:00
Asias He	8ce7cfd64b	tcp: Fix listener port It is supposed to zero the origin's port.	2015-01-21 11:20:05 +02:00
Asias He	0c09a6bd7a	tcp: Return a future for tcp::connect()	2015-01-21 16:20:39 +08:00
Asias He	d6d7e6cb47	tcp: Support syn fin retransmit and timeout Tested with tcp_server + client.go using iptables dropping <SYN,ACK> or <FIN,ACK> on client side. I verified that the SYN or FIN packet is retransmitted and the connection is closed after N (currently 5) retries.	2015-01-21 16:20:39 +08:00
Asias He	602c7c9c98	tcp: Free packets to be sent on RST	2015-01-21 16:20:39 +08:00
Asias He	56f8ba3303	tcp: Clear more when tear down a connection	2015-01-19 13:32:53 +02:00

1 2 3 4 5 ...

486 Commits