Commit Graph

1308 Commits

Author SHA1 Message Date
Gleb Natapov
bebefe2afe net: return reference to hw_feature instead of copying the structure
I noticed that tcp::hw_features() is not inlined and copies the
structure to a caller. The function takes ~1.5% in httpd profiling.
2015-02-19 16:58:50 +02:00
Avi Kivity
7f8d88371a Add LICENSE, NOTICE, and copyright headers to all source files.
The two files imported from the OSv project retain their original licenses.
2015-02-19 16:52:34 +02:00
Avi Kivity
a8698fa17c core: demangle stdout
When using print() to debug on smp, it is very annoying to get interleaved
output.

Fix by wrapping stdout with a fake stream that has a line buffer for each
thread.
2015-02-19 09:26:17 +02:00
Glauber Costa
861d2625b2 file_stream: proper seek support.
Our file_stream interface supports seek, but when we try to seek to arbitrary
locations that are smaller than an aio-boundary (say, for instance, f->seek(4)),
we will end up not being able to perform the read.

We need to guarantee the reads are aligned, and will then present to the caller
the buffer properly offset.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-02-18 22:56:07 +02:00
Gleb Natapov
c4c5899f89 net: handle arp resolution errors in tcp
Pass timeouts up the calling chain and schedule retry if waiter list is
too long.
2015-02-18 20:12:08 +02:00
Avi Kivity
84234b5b9a memory: implement the C11 aligned_alloc() function 2015-02-18 19:48:18 +02:00
Avi Kivity
b4098dac2f core: add distributed::invoke_on() variants not requiring a pointer to member
Current variants of distributed<T>::invoke_on() require member function to
invoke, which may be tedious to implement for some cases.  Add a variant
that supports invoking a functor, accepting the local instance by reference.
2015-02-18 16:52:56 +02:00
Avi Kivity
17914a80cd future: add a utility to promote a type to a its own future
Some of the core functions accept functions returning either an immediate
type, or a future, and return a future in either case (e.g. smp::submit_to()).

To make it easier to metaprogram with these functions, provide a utility
that computes the return type, futurize<T>:

   futurize_t<bar>          => future<bar>

   futurize_t<void>         => future<>

   futurize_t<future<bar>>  =>  future<bar>
2015-02-18 16:52:56 +02:00
Gleb Natapov
f7cade107b seawreck: abort on a connection error 2015-02-18 16:52:56 +02:00
Gleb Natapov
1cfaa7eefe net: populate dpdk redirection table even if there is only one queue
tcp::connect() uses redirection table to figure out what queue will
handle a connection.
2015-02-18 16:52:56 +02:00
Avi Kivity
c1abe0e573 smp: remove gratuitous cache miss when no responses are pending
boost::lockfree::spsc_queue::push() writes the producer index even when no
data is pushed, so check whether we need to do any work beforehand.
2015-02-17 18:00:12 +02:00
Vlad Zolotarov
1934160549 DPDK: Add TSO support
- tcp.hh: Properly calculate the pseudo-header in the TSO case: it should be
     calculated as if ip_len is zero.
   - Enable TSO in the DPDK network backend.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-17 12:47:13 +02:00
Tomasz Grabiec
c11de0476e net: Add overload of ntoh()/hton() for int8_t/uint8_t
They're no-op but make templating easier.
2015-02-16 20:26:36 +02:00
Gleb Natapov
9ee05fdddc seawreck: exit after test is done 2015-02-16 09:54:08 +02:00
Vlad Zolotarov
b8cc243b17 tcp: Pass the correct value of TSO segment size downstream
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-15 19:11:06 +02:00
Avi Kivity
8ca0f21ae6 posix: add missing include 2015-02-15 15:55:35 +02:00
Vlad Zolotarov
5daa8478f4 DPDK_RTE: Add a weak definition for dpdk::qp_mempool_obj_size()
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-12 19:20:15 +02:00
Vlad Zolotarov
d82efca3a8 DPDK: Use std::unique_ptr for storring _xmem blobs
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-12 18:54:46 +02:00
Vlad Zolotarov
95bf98977d DPDK: Recover the DPDK 1.7.x support
- Define MARKER type if not defined.
   - Adjust the Tx zero-copy to the rte_mbuf layout in DPDK 1.7.x.
   - README.md:
      - Bump up the DPDK latest version to 1.8.0.
      - Add a new DPDK configuration description.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-12 18:54:05 +02:00
Gleb Natapov
d4e3cafd10 net: start rx polling only after upper layer is ready to receive 2015-02-12 17:03:22 +02:00
Avi Kivity
a258f290b5 seawreck: fix include 2015-02-12 14:43:12 +02:00
Avi Kivity
ebc2ebbf12 Upgrade http_client to an application, not a test
and rename it to 'seawreck', after wrk.
2015-02-12 14:21:44 +02:00
Avi Kivity
9f87d5bc34 Merge branch 'zero-copy-tx-20' of github.com:cloudius-systems/seastar-dev
dpdk zero-copy tx, from Vlad:

"This patch series introduces zero-copy Tx with DPDK networking backend:
 - Split the dpdk_qp mempool into separate pools for Rx and Tx queues.
 - Configure the dpdk_qp mempools to use external memory buffer when we
   can ensure pinning and virt2phys translation (currently only when
   running on top of hugetlbfs).
 - Properly divide the memory between seastar and DPDK when running on
   top of hugetlbfs.
 - Tx zero-copy itself. See more details in the PATCH7 description."
2015-02-12 11:56:46 +02:00
Vlad Zolotarov
21f4c88c85 DPDK: zero_copy_tx - initial attempt
Send packets without copying fragments data:
   - Poll all the Tx descriptors and place them into a circular_buffer.
     We will take them from there when we need to send new packets.
   - PMD will return the completed buffers descriptors to the Tx mempool.
     This way we are going to know that we may release the buffer.
   - "move" the packet object into the last segment's descriptor's private data.
     When this fragment is completed means the whole packet has been sent
     and its memory may be released. So, we will do it by calling the packet's
     destructor.

Exceptions:
   - Copy if hugepages backend is not enabled.
   - Copy when we failed to send in a zero-copy flow (e.g. when we failed
     to translate a buffer virtual address).
   - Copy if first frag requires fragmentation below 128 bytes level - this is
     in order to avoid headers splitting.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v5:
   - NULL -> nullptr across the board.
   - Removed unused macros: MBUF_ZC_PRIVATE() and max_frags_zc.
   - Improved the local variables localization according to Nadav's remarks.
   - tx_buf class:
      - Don't regress the whole packet to the copy-send if a single fragment failed to be sent
        in a zero-copy manner (e.g. its data failed the virt2phys translation). Send only such a
        fragment in a copy way and try to send the rest of the fragments in a zero-copy way.
      - Make set_packet() receive packet&&.
      - Fixed the comments in check_frag0(): we check first 128 bytes and not first 2KB.
        starting from v2.
      - Use assert() instead of rte_exit() in do_one_frag().
      - Rename in set_one_data_buf() and in copy_one_data_buf(): l -> buf_len
      - Improve the assert about the size of private data in the tx_buf class:
         - Added two MARKER fields at the beginning and at the end of the private fields section
           which are going to be allocated on the mbuf's private data section.
         - Assert on the distance between these two markers.
      - Replace the sanity_check() (checks that packet doesn't have a zero-length) in a
        copy-flow by an assert() in a general function since this check
        is relevant both for a copy and for a zero-copy flows.
      - Make a sanity_check to be explicitly called frag0_check.
      - Make from_packet() receive packet&&.
      - In case frag0_check() fails - copy only the first fragment and
        not the whole packet.
   - tx_buf_factory class:
      - Change the interface to work with tx_buf* instead of tx_buf&.
      - Better utilize for-loop facilities in gc().
      - Kill the extra if() in the init_factory().
      - Use std::deque instead of circular_buffer for storing elements in tx_buf_factory.
      - Optimize the tx_buf_factory::get():
         - First take the completed buffers from the mempool and only if there
           aren't any - take from the factory's cache.
      - Make Tx mempools using cache: this significantly improves the performance despite the fact that it's
        not the right mempool configuration for a single-producer+single-consumer mode.
      - Remove empty() and size() methods.
   - Add comments near the assert()s in the fast-path.
   - Removed the not-needed "inline" qualifiers:
      - There is no need to specify "inline" qualifier for in-class defined
        methods INCLUDING static methods.
      - Defining process_packets() and poll_rx_once() as inline degraded the
        performance by about 1.5%.
   - Added a _tx_gc_poller: it will call tx_buf_factory::gc().
   - Don't check a pointer before calling free().
   - alloc_mempool_xmem(): Use posix_memalign() instead of memalign().

New in v4:
   - Improve the info messages.
   - Simplified the mempool name creation code.
   - configure.py: Opt-out the invalid-offsetof compilation warning.

New in v3:
   - Add missing macros definitions dropped in v2 by mistake.

New in v2:
   - Use Tx mbufs in a LIFO way for better cache utilization.
   - Lower the frag0 non-split thresh to 128 bytes.
   - Use new (iterators) semantics in circular_buffer.
   - Use optional<packet> for storing the packing in the mbuf.
   - Use rte_pktmbuf_alloc() instead of __rte_mbuf_raw_alloc().
   - Introduce tx_buf class:
      - Hide the private rte_mbuf area handling.
      - Hide packet to rte_mbuf cluster translation handling.
   - Introduce a "Tx buffers factory" class:
      - Hide the rte_mbuf flow details:
            mempool->circular_buffer->(PMD->)mempool
   - Templatization:
      - Make huge_pages_mem_backend a dpdk_qp class template parameter.
      - Unite the from_packet_xxx() code into a single template function.
      - Unite the translate_one_frag() and copy_one_frag() into a single
        template function.
2015-02-12 11:04:07 +02:00
Asias He
51adb20bda tests: Add http_client
It is based on tcp_client and works with our httpd server.

1) timer based, to run the test for 10 seconds
$ http_client --server 192.168.66.100:10000 --conn 100  --duration 10 --smp 2
========== http_client ============
Server: 192.168.66.100:10000
Connections: 100
Requests/connection: dynamic (timer based)
Requests on cpu 0: 33400
Requests on cpu 1: 33368
Total cpus: 2
Total requests: 66768
Total time: 10.011478
Requests/sec: 6669.145442
========== done ============

2) nr of reqs per connection based, to run the test with 100 connections
each has to run 1000 reqs
$ http_client --server 192.168.66.100:10000 --conn 100 --reqs 1000 --smp 2
========== http_client ============
Server: 192.168.66.100:10000
Connections: 100
Requests/connection: 1000
Requests on cpu 0: 50000
Requests on cpu 1: 50000
Total cpus: 2
Total requests: 100000
Total time: 15.002731
Requests/sec: 6665.453192
========== done ============

This patch is based on Shlomi's initial version.

Signed-off-by: Shlomi Livne <shlomi@cloudius-systems.com>
Signed-off-by: Asias He <asias@cloudius-systems.com>
2015-02-12 10:02:48 +02:00
Raphael S. Carvalho
20151b7b2a memcached: capture port by value
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-02-12 10:00:37 +02:00
Vlad Zolotarov
4d0f2d3e4c DPDK_RTE: Give rte_eal_init() -m parameter when we use hugetlbfs
When we use hugetlbfs we will give mempools external buffer for allocations
but the mempool internals still need memory.
We will assume that each CPU core is going to have a HW QP ("worst" case) and
provide the DPDK with enough memory to be able to allocate them all.

The memory above is subtracted from the total amount of memory given to the application
(with -m seastar application parameter).

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-11 19:27:12 +02:00
Vlad Zolotarov
46b6644c35 DPDK: add a function that returns a number of bytes needed for each QP's mempool objects
This function is needed when we want to estimate a number of memory we want to give to DPDK
when we can provide a mempool an external memory buffer.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-11 19:27:12 +02:00
Vlad Zolotarov
82e20564b0 DPDK: Initialize mempools to work with external memory
If seastar is configured to use hugetlbfs initialize mempools
with external memory buffer. This way we are going to better control the overall
memory consumption.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v2:
   - Use char* instead of void* for pointer's arithmetics.
2015-02-11 19:27:12 +02:00
Vlad Zolotarov
d4cddbc3d0 DPDK: Use separate pools for Rx and Tx queues and adjust their sizes
There is no reason for Rx and Tx pools to be of the same size:

Rx pool is 3 times the ring size to give the upper layers some time
to free the Rx buffers before the ring stalls with no buffers.

Tx has absolutely different constraints: since it provides a back pressure
to the upper layers if HW doesn't keep up there is no need to allow more buffers
in the air than the amount we may send in a single rte_eth_tx_burst() call.
Therefore we need 2 times HW ring size buffers since HW may release the whole
ring of buffers in a single rte_eth_tx_burst() call and thus we may be able to
place another whole ring of buffers in the same call.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

New in v4:
   - Fixed the info message.
2015-02-11 19:27:12 +02:00
Vlad Zolotarov
18f35236db memory: Move page_size, page_bits and huge page size definitions to header
They are going to be used in more places (not just in memory.cc).

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-11 19:27:12 +02:00
Avi Kivity
3f848c5714 Merge branch 'file'
Add an adapter from our block-based files to our character stream interface,
input_stream, and a test program demonstrating their use.
2015-02-11 17:45:13 +02:00
Avi Kivity
64930bc610 tests: add linecount tests
Demonstrates and tests file_input_stream.
2015-02-11 15:38:51 +02:00
Avi Kivity
d7eb4e96fb app-template: add support for positional options
Example:

    app_template app;
    namespace bpo = boost::program_options;
    app.add_positional_options({
        { "file", bpo::value<std::string>(), "File to process", 1 },
    });
2015-02-11 15:38:51 +02:00
Avi Kivity
af0bf06836 core: add file_data_source, file_input_stream
Implement a character stream backed by a file.
2015-02-11 15:38:51 +02:00
Avi Kivity
d31de31aac core: add input_stream::reset()
Useful for seekable streams, to drop existing buffered data.
2015-02-11 15:38:49 +02:00
Raphael S. Carvalho
c725014614 memcached: add option to listen on a different port
useful when testing multiple memcached servers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-02-10 19:27:43 +02:00
Avi Kivity
2dadcdc5e7 core: make some data_source internals available to derived classes
Useful for adding functionality such as seekable streams.
2015-02-10 19:00:45 +02:00
Avi Kivity
381814aeaf stream.hh: add missing include 2015-02-10 18:59:38 +02:00
Avi Kivity
951a93a534 file.hh: add missing include 2015-02-10 18:59:16 +02:00
Tomasz Grabiec
10e58e0cda tests: Make test runner catch and forward exceptions thrown directly from task 2015-02-10 14:47:42 +02:00
Tomasz Grabiec
85c67001dd tests: Add test for exceptions thrown from do_until() 2015-02-10 14:47:42 +02:00
Tomasz Grabiec
331d5e1569 core: Fail do_until() future when the callback throws
Otherwise we will aband the result promise, which results in abort.
2015-02-10 14:47:42 +02:00
Avi Kivity
ee58c77008 httpd: fix unbounded memory use in eerror handling
httpd uses recursion for its read loop:

  future<> read() {
     _read_buf.consume().then([] {
        ...
        if more work:
           return read();
     });
  }

However, after error handling was added, it looks like this:

  future<> read() {
     _read_buf.consume().then([] {
        ...
        if more work:
           return read();
     }).rescue(...);
  }

The problem is that rescue() is called for every iteration of the loop,
instead of for the loop in its entirety.  This means that a rescue
continuation is allocated for every processed request, but they will only
be called after the entire loop terminates.  This results in tons of
allocated memory.

Fix by moving error handling to the end of the loop (and incidentally using
do_until() instead of recursion).
2015-02-10 12:00:32 +02:00
Avi Kivity
29366cb076 net: add byteorder (ntoh/hton) variants for signed types 2015-02-09 17:07:21 +02:00
Asias He
f0c1bcdb33 tcp: Switch to debug print for persist timer
It is a left over during development.
2015-02-09 10:58:16 +02:00
Asias He
a192391ac6 tcp: Init timer callback using constructor 2015-02-09 10:58:15 +02:00
Asias He
0ac0e06d32 packet: Linearize after merge
The packet will be merged with the old packet anyway. Linearize after
the merge.
2015-02-09 10:58:15 +02:00
Raphael S. Carvalho
bf41da8974 core: small optimization when constructing std::vector<cpu>
Size of std::vector<cpu> can be pre-determined, then let's reserve memory ahead
of time so that push back calls would be optimized.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-02-08 19:05:45 +02:00
Avi Kivity
7a704f7a40 sstring: fix truncation in compare()
If the difference between the sizes of the two strings is larger than can
be represented by an int, truncation will occur and the sign of the result
is undefined.

Fix by using explicit tests and return values.
2015-02-08 11:41:22 +02:00