scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-19 16:15:07 +00:00

Author	SHA1	Message	Date
Gleb Natapov	d4e3cafd10	net: start rx polling only after upper layer is ready to receive	2015-02-12 17:03:22 +02:00
Avi Kivity	a258f290b5	seawreck: fix include	2015-02-12 14:43:12 +02:00
Avi Kivity	ebc2ebbf12	Upgrade http_client to an application, not a test and rename it to 'seawreck', after wrk.	2015-02-12 14:21:44 +02:00
Avi Kivity	9f87d5bc34	Merge branch 'zero-copy-tx-20' of github.com:cloudius-systems/seastar-dev dpdk zero-copy tx, from Vlad: "This patch series introduces zero-copy Tx with DPDK networking backend: - Split the dpdk_qp mempool into separate pools for Rx and Tx queues. - Configure the dpdk_qp mempools to use external memory buffer when we can ensure pinning and virt2phys translation (currently only when running on top of hugetlbfs). - Properly divide the memory between seastar and DPDK when running on top of hugetlbfs. - Tx zero-copy itself. See more details in the PATCH7 description."	2015-02-12 11:56:46 +02:00
Vlad Zolotarov	21f4c88c85	DPDK: zero_copy_tx - initial attempt Send packets without copying fragments data: - Poll all the Tx descriptors and place them into a circular_buffer. We will take them from there when we need to send new packets. - PMD will return the completed buffers descriptors to the Tx mempool. This way we are going to know that we may release the buffer. - "move" the packet object into the last segment's descriptor's private data. When this fragment is completed means the whole packet has been sent and its memory may be released. So, we will do it by calling the packet's destructor. Exceptions: - Copy if hugepages backend is not enabled. - Copy when we failed to send in a zero-copy flow (e.g. when we failed to translate a buffer virtual address). - Copy if first frag requires fragmentation below 128 bytes level - this is in order to avoid headers splitting. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v5: - NULL -> nullptr across the board. - Removed unused macros: MBUF_ZC_PRIVATE() and max_frags_zc. - Improved the local variables localization according to Nadav's remarks. - tx_buf class: - Don't regress the whole packet to the copy-send if a single fragment failed to be sent in a zero-copy manner (e.g. its data failed the virt2phys translation). Send only such a fragment in a copy way and try to send the rest of the fragments in a zero-copy way. - Make set_packet() receive packet&&. - Fixed the comments in check_frag0(): we check first 128 bytes and not first 2KB. starting from v2. - Use assert() instead of rte_exit() in do_one_frag(). - Rename in set_one_data_buf() and in copy_one_data_buf(): l -> buf_len - Improve the assert about the size of private data in the tx_buf class: - Added two MARKER fields at the beginning and at the end of the private fields section which are going to be allocated on the mbuf's private data section. - Assert on the distance between these two markers. - Replace the sanity_check() (checks that packet doesn't have a zero-length) in a copy-flow by an assert() in a general function since this check is relevant both for a copy and for a zero-copy flows. - Make a sanity_check to be explicitly called frag0_check. - Make from_packet() receive packet&&. - In case frag0_check() fails - copy only the first fragment and not the whole packet. - tx_buf_factory class: - Change the interface to work with tx_buf* instead of tx_buf&. - Better utilize for-loop facilities in gc(). - Kill the extra if() in the init_factory(). - Use std::deque instead of circular_buffer for storing elements in tx_buf_factory. - Optimize the tx_buf_factory::get(): - First take the completed buffers from the mempool and only if there aren't any - take from the factory's cache. - Make Tx mempools using cache: this significantly improves the performance despite the fact that it's not the right mempool configuration for a single-producer+single-consumer mode. - Remove empty() and size() methods. - Add comments near the assert()s in the fast-path. - Removed the not-needed "inline" qualifiers: - There is no need to specify "inline" qualifier for in-class defined methods INCLUDING static methods. - Defining process_packets() and poll_rx_once() as inline degraded the performance by about 1.5%. - Added a _tx_gc_poller: it will call tx_buf_factory::gc(). - Don't check a pointer before calling free(). - alloc_mempool_xmem(): Use posix_memalign() instead of memalign(). New in v4: - Improve the info messages. - Simplified the mempool name creation code. - configure.py: Opt-out the invalid-offsetof compilation warning. New in v3: - Add missing macros definitions dropped in v2 by mistake. New in v2: - Use Tx mbufs in a LIFO way for better cache utilization. - Lower the frag0 non-split thresh to 128 bytes. - Use new (iterators) semantics in circular_buffer. - Use optional<packet> for storing the packing in the mbuf. - Use rte_pktmbuf_alloc() instead of __rte_mbuf_raw_alloc(). - Introduce tx_buf class: - Hide the private rte_mbuf area handling. - Hide packet to rte_mbuf cluster translation handling. - Introduce a "Tx buffers factory" class: - Hide the rte_mbuf flow details: mempool->circular_buffer->(PMD->)mempool - Templatization: - Make huge_pages_mem_backend a dpdk_qp class template parameter. - Unite the from_packet_xxx() code into a single template function. - Unite the translate_one_frag() and copy_one_frag() into a single template function.	2015-02-12 11:04:07 +02:00
Asias He	51adb20bda	tests: Add http_client It is based on tcp_client and works with our httpd server. 1) timer based, to run the test for 10 seconds $ http_client --server 192.168.66.100:10000 --conn 100 --duration 10 --smp 2 ========== http_client ============ Server: 192.168.66.100:10000 Connections: 100 Requests/connection: dynamic (timer based) Requests on cpu 0: 33400 Requests on cpu 1: 33368 Total cpus: 2 Total requests: 66768 Total time: 10.011478 Requests/sec: 6669.145442 ========== done ============ 2) nr of reqs per connection based, to run the test with 100 connections each has to run 1000 reqs $ http_client --server 192.168.66.100:10000 --conn 100 --reqs 1000 --smp 2 ========== http_client ============ Server: 192.168.66.100:10000 Connections: 100 Requests/connection: 1000 Requests on cpu 0: 50000 Requests on cpu 1: 50000 Total cpus: 2 Total requests: 100000 Total time: 15.002731 Requests/sec: 6665.453192 ========== done ============ This patch is based on Shlomi's initial version. Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com> Signed-off-by: Asias He <asias@cloudius-systems.com>	2015-02-12 10:02:48 +02:00
Raphael S. Carvalho	20151b7b2a	memcached: capture port by value Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-02-12 10:00:37 +02:00
Vlad Zolotarov	4d0f2d3e4c	DPDK_RTE: Give rte_eal_init() -m parameter when we use hugetlbfs When we use hugetlbfs we will give mempools external buffer for allocations but the mempool internals still need memory. We will assume that each CPU core is going to have a HW QP ("worst" case) and provide the DPDK with enough memory to be able to allocate them all. The memory above is subtracted from the total amount of memory given to the application (with -m seastar application parameter). Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-11 19:27:12 +02:00
Vlad Zolotarov	46b6644c35	DPDK: add a function that returns a number of bytes needed for each QP's mempool objects This function is needed when we want to estimate a number of memory we want to give to DPDK when we can provide a mempool an external memory buffer. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-11 19:27:12 +02:00
Vlad Zolotarov	82e20564b0	DPDK: Initialize mempools to work with external memory If seastar is configured to use hugetlbfs initialize mempools with external memory buffer. This way we are going to better control the overall memory consumption. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v2: - Use char* instead of void* for pointer's arithmetics.	2015-02-11 19:27:12 +02:00
Vlad Zolotarov	d4cddbc3d0	DPDK: Use separate pools for Rx and Tx queues and adjust their sizes There is no reason for Rx and Tx pools to be of the same size: Rx pool is 3 times the ring size to give the upper layers some time to free the Rx buffers before the ring stalls with no buffers. Tx has absolutely different constraints: since it provides a back pressure to the upper layers if HW doesn't keep up there is no need to allow more buffers in the air than the amount we may send in a single rte_eth_tx_burst() call. Therefore we need 2 times HW ring size buffers since HW may release the whole ring of buffers in a single rte_eth_tx_burst() call and thus we may be able to place another whole ring of buffers in the same call. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v4: - Fixed the info message.	2015-02-11 19:27:12 +02:00
Vlad Zolotarov	18f35236db	memory: Move page_size, page_bits and huge page size definitions to header They are going to be used in more places (not just in memory.cc). Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-02-11 19:27:12 +02:00
Avi Kivity	3f848c5714	Merge branch 'file' Add an adapter from our block-based files to our character stream interface, input_stream, and a test program demonstrating their use.	2015-02-11 17:45:13 +02:00
Avi Kivity	64930bc610	tests: add linecount tests Demonstrates and tests file_input_stream.	2015-02-11 15:38:51 +02:00
Avi Kivity	d7eb4e96fb	app-template: add support for positional options Example: app_template app; namespace bpo = boost::program_options; app.add_positional_options({ { "file", bpo::value<std::string>(), "File to process", 1 }, });	2015-02-11 15:38:51 +02:00
Avi Kivity	af0bf06836	core: add file_data_source, file_input_stream Implement a character stream backed by a file.	2015-02-11 15:38:51 +02:00
Avi Kivity	d31de31aac	core: add input_stream::reset() Useful for seekable streams, to drop existing buffered data.	2015-02-11 15:38:49 +02:00
Raphael S. Carvalho	c725014614	memcached: add option to listen on a different port useful when testing multiple memcached servers. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-02-10 19:27:43 +02:00
Avi Kivity	2dadcdc5e7	core: make some data_source internals available to derived classes Useful for adding functionality such as seekable streams.	2015-02-10 19:00:45 +02:00
Avi Kivity	381814aeaf	stream.hh: add missing include	2015-02-10 18:59:38 +02:00
Avi Kivity	951a93a534	file.hh: add missing include	2015-02-10 18:59:16 +02:00
Tomasz Grabiec	10e58e0cda	tests: Make test runner catch and forward exceptions thrown directly from task	2015-02-10 14:47:42 +02:00
Tomasz Grabiec	85c67001dd	tests: Add test for exceptions thrown from do_until()	2015-02-10 14:47:42 +02:00
Tomasz Grabiec	331d5e1569	core: Fail do_until() future when the callback throws Otherwise we will aband the result promise, which results in abort.	2015-02-10 14:47:42 +02:00
Avi Kivity	ee58c77008	httpd: fix unbounded memory use in eerror handling httpd uses recursion for its read loop: future<> read() { _read_buf.consume().then([] { ... if more work: return read(); }); } However, after error handling was added, it looks like this: future<> read() { _read_buf.consume().then([] { ... if more work: return read(); }).rescue(...); } The problem is that rescue() is called for every iteration of the loop, instead of for the loop in its entirety. This means that a rescue continuation is allocated for every processed request, but they will only be called after the entire loop terminates. This results in tons of allocated memory. Fix by moving error handling to the end of the loop (and incidentally using do_until() instead of recursion).	2015-02-10 12:00:32 +02:00
Avi Kivity	29366cb076	net: add byteorder (ntoh/hton) variants for signed types	2015-02-09 17:07:21 +02:00
Asias He	f0c1bcdb33	tcp: Switch to debug print for persist timer It is a left over during development.	2015-02-09 10:58:16 +02:00
Asias He	a192391ac6	tcp: Init timer callback using constructor	2015-02-09 10:58:15 +02:00
Asias He	0ac0e06d32	packet: Linearize after merge The packet will be merged with the old packet anyway. Linearize after the merge.	2015-02-09 10:58:15 +02:00
Raphael S. Carvalho	bf41da8974	core: small optimization when constructing std::vector<cpu> Size of std::vector<cpu> can be pre-determined, then let's reserve memory ahead of time so that push back calls would be optimized. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-02-08 19:05:45 +02:00
Avi Kivity	7a704f7a40	sstring: fix truncation in compare() If the difference between the sizes of the two strings is larger than can be represented by an int, truncation will occur and the sign of the result is undefined. Fix by using explicit tests and return values.	2015-02-08 11:41:22 +02:00
Pekka Enberg	9a55e9fd22	sstring: Add 'compare' and 'operator<' Add string comparison functions to basic_sstring that are required for C++ containers such as std::map and std::multimap. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-02-08 11:12:31 +02:00
Pekka Enberg	fc7cb5ab5e	shared_ptr: Fix assignment of polymorphic types Fix the assignment operator to work with polymorphic types. Suggested by Nadav. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-02-08 10:24:21 +02:00
Tomasz Grabiec	f948ee79bd	test.py: Add --name filtering option	2015-02-08 10:09:29 +02:00
Tomasz Grabiec	ead03f1b08	test.py: Add --mode parameter for filtering tests	2015-02-08 10:09:29 +02:00
Avi Kivity	4b28eb638f	Merge branch 'asias/tcp_v1' of github.com:cloudius-systems/seastar-dev tcp queue from Asias: "Contains both fixes and improvemnts".	2015-02-07 20:20:57 +02:00
Raphael S. Carvalho	2195f77879	memcached: stats: rename evicted to evictions Change for compliance with stock memcached. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-02-07 19:57:19 +02:00
Nadav Har'El	a9ef189a54	core: add support for enum types as hash-table keys This patchs adds a header file, "core/enum.hh"; Code which includes this header file will be able to use an enumerated type as the key in a hash table. The header file implements a hash function for all enumerated types, by using the standard hash function of the underlying integer type.	2015-02-07 12:33:48 +02:00
Asias He	abd5a24354	tcp: Implement persist timer It is used to recover from a race where the sender is waiting for a window update and the receiver is waiting for the sender to send more, because somehow the window update carried in the ACK packet is not seen by the sender.	2015-02-05 17:52:32 +08:00
Asias He	6a468dfd3d	packet: Linearize more in packet_merger::merge This fix tcp_server rxrx test on DPDK. The problem is that when we receive out of order packets, we will hold the packet in the ooo queue. We do linearize on the incoming packet which will copy the packet and thus free the old packet. However, we missed one case where we need to linearize. As a result, the original packet will be held in the ooo queue. In DPDK, we have fixed buffer in the rx pool. When all the dpdk buffer are in ooo queue, we will not be able to receive further packets. So rx hangs, even ping will not work.	2015-02-05 17:52:32 +08:00
Asias He	4f21d500cb	tcp: Do nothing if already in CLOSED state when close This fix the following: Server side: $ tcp_server Client side: $ go run client.go -host 192.168.66.123 -conn 10 -test txtx $ control-c At this time, connection in tcp_server will be in CLOSED state (reset by the remote), then tcp_server will call tcp::tcb::close() and wait for wait_for_all_data_acked(), but no one will signal it. Thus we have tons of leaked connection in CLOSED state.	2015-02-05 17:52:32 +08:00
Asias He	dd741d11b8	tcp: Fix FIN is not sent in some cases We call output_one to make sure a packet with FIN is actually generated and then sent out. If we only call output() and _packetq is not empty, in tcp::tcb::get_packet(), packet with FIN will not be generated, thus we will not send out a FIN. This can happen when retransmit packets have been queued into _packetq, then ACK comes which ACK all of the unacked data, then the application call close() to close the connection.	2015-02-05 17:52:32 +08:00
Asias He	f600e3c902	tcp: Add queued_len Take the number of queued data into account when checking if all the data is sent.	2015-02-05 17:52:32 +08:00
Asias He	fca74f9563	tcp: Implement RFC6582 NewReno We currently have RFC5681, a.k.a Reno TCP, as the congestion control algorithms: slow start, congestion avoidance, fast retransmit, and fast recovery. RFC6582 describes a specific algorithm for responding to partial acknowledgments, referred to as NewReno, to improve Reno.	2015-02-05 17:45:48 +08:00
Asias He	426938f4ed	tcp: Add Limited Transfer per RFC3042 and RFC5681 When RFC3042 is in use, additional data sent in limited transmit MUST NOT be included in this calculation to update _snd.ssthresh.	2015-02-05 17:05:00 +08:00
Asias He	2289b03354	httpd: Fix RST handling I found wrk sometimes sends RST instead a FIN to close a connection. In this case, we will reset the connection and go to CLOSED state. However httpd will not delete this, so we will have leaked connections in CLOSED state. Fix by handling the exception and sending an empty response as we do in EOF case. Here we do not pass the exception to upper layer again, otherwise httpd will be very noise.	2015-02-05 16:57:58 +08:00
Gleb Natapov	89763c95c9	core: optimise timer completions vs periodic timers The way periodic timers are rearmed during timer completion causes timer_settime() to be called twice for each periodic timer completion: once during rearm and second time by enable_fn(). Fix it by providing another function that only re-adds timer into timers container, but do not call timer_settime().	2015-01-29 12:43:28 +02:00
Avi Kivity	94e01e6d0e	tests: exit after timertest ends	2015-01-29 12:24:03 +02:00
Avi Kivity	070eb7d496	tests: serialize timer tests Otherwise the output gets interspersed.	2015-01-29 12:20:39 +02:00
Avi Kivity	59c0d7e893	smp: fix work item deletion Delete it after completion, not after responding.	2015-01-29 12:14:05 +02:00

1 2 3 4 5 ...

1289 Commits