Commit Graph

336 Commits

Author SHA1 Message Date
Asias He
53f95abd96 virtio: Fix feature setup
This fixes a big tcp_server rx regression.

Before:
========== rxrx ============
Server:  192.168.66.123:10000
Connections:  100
Bytes Sent(MiB):  10000
Total Time(Secs):  85.074086675           --->> big regression!!!
Bandwidth(MiB/Sec):  117.54460601148733

After:
========== rxrx ============
Server:  192.168.66.123:10000
Connections:  100
Bytes Sent(MiB):  10000
Total Time(Secs):  9.905637754
Bandwidth(MiB/Sec):  1009.5261151622362
2014-12-10 11:01:54 +02:00
Avi Kivity
b87a76412c packet: avoid hand-rolled deleter chaining, use deleter::append instead
The hand-rolled deleter chaining in packet::append was invalidated
by the make_free_deleter() optimization, since deleter->_next is no longer
guaranteed to be valid (and deleter::operator->() is still exposed, despite
that).

Switch to deleter::append(), which does the right thing.

Fixes a memory leak in tcp_server.
2014-12-09 20:37:17 +02:00
Gleb Natapov
8bb82512a1 net: enable RSS for V4 IP/UDP/TCP 2014-12-09 18:55:19 +02:00
Gleb Natapov
73f6d943e1 net: separate device initialization from queues initialization
This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.

The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
2014-12-09 18:55:14 +02:00
Gleb Natapov
2fb3dc03f6 net: remove unused opts parameter from proxy_net_device constructor 2014-12-09 18:55:05 +02:00
Asias He
9a9297c89d ip: Implement fragment timeout and memory usage limit 2014-12-09 09:59:44 +02:00
Asias He
89c8c6148f net: Add packet::memory
Add packet::memory() which estimates the memory load (by adding sizeof
packet::impl). Note it will only be accurate after linearize/compact.
2014-12-09 09:59:44 +02:00
Asias He
c03e356873 net: Improve packet::linearize
Free the original memory earlier if copied all of them.
2014-12-09 09:59:43 +02:00
Nadav Har'El
3f2ea82e6d dpdk: rx checksum offloading
If the card supports this (and usually, it does), enable rx checksum
offloading by the card, and avoid calculating the checksums ourselves.

With rx checksum offloading, the card checks in incoming packets the
IP header checksum and the L4 (TCP or UDP) checksum, and gives us a
flag when one of them is wrong, meaning that we do not need to do these
calculations ourselves.

This patch improves memcached performance on my setup by almost 3%.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2014-12-08 20:41:31 +02:00
Avi Kivity
f4d7bd7e00 reactor: register pollers using a RAII class
Avoids leaking a poller.
2014-12-07 17:36:44 +02:00
Vlad Zolotarov
5bc89b974a dpdk: First proper offload features initialization
- Query the port for its caps.
 - Properly adjust the queue numbers according to the caps.
 - Enable RSS only if the final queues number is greater than 1.
 - Enable Rx VLAN stripping.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2014-12-07 17:32:36 +02:00
Vlad Zolotarov
5cc8785b96 packet: Added HW VLAN stipping option.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2014-12-07 17:32:36 +02:00
Vlad Zolotarov
2d10018870 dpdk: separate the EAL initialization from port initialization
- Create a new class dpdk_eal that initializes DPDK EAL.
 - Get rid of portmask crap and provide a port index to a dpdk::net_device
   constructor.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2014-12-07 17:31:12 +02:00
Avi Kivity
a2016bc1dd ip: fix smp fragment reassembly
ipv4::handle_on_cpu() did not properly convert from network byte order, so
it saw any packets with DF=1 as fragmented.

Fix by applying the proper conversion.
2014-12-07 12:01:31 +02:00
Avi Kivity
2ee0239a4a Merge branch 'tgrabiec/zero-copy-2' of github.com:cloudius-systems/seastar-dev
Zero-copy memcached get from Tomasz:

"I've measured memcached on muninn/huginn to be 7.5% better with this on vhost
stack."
2014-12-04 16:31:04 +02:00
Tomasz Grabiec
c4335c49f6 core: convert output APIs to work on packets
This way zero-copy supporting code can put data directly to packet
object and pass it through all layers efficiently.
2014-12-04 13:51:26 +01:00
Tomasz Grabiec
72b0794759 packet: add constructor for appending temporary_buffers 2014-12-04 13:37:35 +01:00
Tomasz Grabiec
3a2d74e3d3 packet: add reserve() method 2014-12-04 13:37:35 +01:00
Tomasz Grabiec
f3dada6f1d packet: add constructor for appending deleters
Deleters not always come with fragments. When multiple fragments share
a deleter, first fragments are appended and then one deleter for all
of them.
2014-12-04 13:37:35 +01:00
Tomasz Grabiec
8ffcdac455 packet: move lambdas rather than copy them
Some lambdas are not copyable.
2014-12-04 13:37:35 +01:00
Tomasz Grabiec
2650c68824 packet: add more constructor variants 2014-12-04 13:37:35 +01:00
Avi Kivity
3e4842a2a1 Merge branch 'asias/ip' of github.com:cloudius-systems/seastar-dev
IP fragment reassembly from Asias.
2014-12-03 16:03:18 +02:00
Asias He
59aa280f0d ip: Add IPv4 reassembly support
If a TCP or UDP IP datagram is fragmented, only the first fragment will
contain the port information. When a fragment without port information
is received, we have no idea which "stream" this fragment belongs to,
thus we no idea how to forward this packet.

To solve this problem, we use "forward twice" method. When IP datagram
which needs fragmentation is received, we forward it using the
frag_id(src_ip, dst_ip, identification, protocol) hash. When all the
fragments are received, we forward it using the connection_id(src_ip,
src_port, dst_ip, dst_port) hash.
2014-12-03 21:40:49 +08:00
Gleb Natapov
4d3b6497ea reactor: rework poll infrastructure
Move idle state management out from smp poller back to generic code. Each
poller returns if it did any useful work and generic code decided if it
should go idle based on that. If a poller requires constant polling it
should always return true.
2014-12-03 14:37:33 +02:00
Tomasz Grabiec
f556172619 temporary_buffer: make empty buffer don't need to malloc() 2014-12-03 13:15:09 +01:00
Tomasz Grabiec
76a8908b21 virtio: fix indentation 2014-12-03 13:15:09 +01:00
Asias He
2702af5e7d net: Add help packet_merger
This can be used for both TCP out-of-order and IP fragmentation merging.
2014-12-03 17:47:30 +08:00
Asias He
8335787268 net: Expose interface::forward
This can be used with ipv4 fragmentation.
2014-12-03 17:47:29 +08:00
Asias He
7ca33fdd72 ip: Add helper for fragmentation 2014-12-03 17:47:29 +08:00
Gleb Natapov
7dbc333da6 core: Allow forwarding from/to any cpu 2014-12-03 17:47:29 +08:00
Gleb Natapov
bf46f9c948 net: Change how networking devices are created
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).

This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
2014-11-30 18:10:08 +02:00
Vlad Zolotarov
12caa3afe4 net: add option to use a dpdk PMD networking backend
- Added "dpdk-pmd" option:
     - Defaulted to FALSE.
     - When TRUE - use DPDK PMD drivers.
 - Call for dpdk net_device creation function if dpdk-poll option is given
 - Added DPDK networking backend options to all options list

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2014-11-30 12:14:56 +02:00
Vlad Zolotarov
5cd984b5cc dpdk: Initial commit
- Currently only a single port and a single queue are supported.
    - All DPDK EAL configuration is hard-coded in the dpdk_net_device constructor instead
      of coming from the app parameters.
    - No offload features are enabled.
    - Tx: will spin in the dpdk_net_device::send() till there is a place in the HW ring to
          place a current packet.
    - Tx: copy data from the `packet` frags into the rte_mbuf's data.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2014-11-30 12:13:52 +02:00
Asias He
88a1a37a88 ip: Support IP fragmentation in TX path
Tested with UDP sending large datagrams with ufo off.
2014-11-30 10:16:38 +02:00
Glauber Costa
b3c163e603 xen: fix typo in event channel detection
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-28 14:15:36 +01:00
Glauber Costa
c3ae30b760 xen: delete event channel as well
If we don't have split channels, we need to delete the relevant property.
because xs_rm() returns true if the feature does not exist, it won't affect the
transaction if we just delete all of them. Therefore we don't need to do any
conditional test.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Glauber Costa
3848130f2f xen: only add features to feature array
We are adding everything we read into the features array. Because in the
destructor we will remove everything in the features list, we'll end up
removing more than we should. Things like the mac address, handle, etc, should
never be deleted.

This is not a problem for OSv because usually, after the destructor is called,
the whole guest is down. But for userspace, the network card is left there,
but will cease to work if we delete too much.

After we do that with the _features array - it's original intent, it becomes
reduntant with features nack.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Glauber Costa
bd8a18c178 xen: umask event channels when setup is ready
This is not required for OSv, but is required for userspace operation.
It won't work without it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Avi Kivity
861957e5ba Merge branch 'glommer/xen' of github.com:cloudius-systems/seastar-dev
Glauber says:

"This patch yields a small performance boost. It is not complete, since the rest
of the performance work is still missing since half of that is in OSv.

But more importantly, it now works on AWS."
2014-11-26 18:30:26 +02:00
Glauber Costa
b56a89d5c9 xen: translate feature name
When the backend advertises "feature-rx-copy", the frontend should register for
"request-rx-copy". The local hypervisor seems to be forgiving about it, but the
one in AWS, it is not, and doubly so.

First, it doesn't recognize these as the same. And second, it refuses to
connect the backend if this feature is not advertised by the frontend.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-26 17:22:58 +01:00
Glauber Costa
a9a79e3ba6 xen: ring unification
The ring processing is almost the same for both rx and tx, with the exception
with the core of the action. We can actually unify them nicely with some use of
template programming.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-26 17:21:09 +01:00
Glauber Costa
e7c9aeb8a5 xen: interrupt mitigation
There are two things we can do that will lead to less interrupts being sent.
The first, is to read the new rsp_cons value at the end of every interaction.
If the backend produces more frames in the mean time, we'll be able to process
in the same round, without getting another interrupt.

The other, is to set the rsp_event only after all the frames are processed.

As a matter of fact, both the tx and rx rings did one of them, but not the same
one. The next patch will unify the ring code to avoid problems like that in the
future.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-26 17:17:45 +01:00
Gleb Natapov
4f4731c37b net: delay network stack creation
Network device has to be available when network stack is created, but
sometimes network device creation should wait for device initialization
by another cpu. This patch makes it possible to delay network stack
creation until network device is available.
2014-11-26 16:46:04 +02:00
Avi Kivity
87fdf52205 Merge branch 'clang' 2014-11-26 15:01:14 +02:00
Avi Kivity
e8894227bc xen: declare nr_ents higher to satisfy clang 2014-11-26 15:00:13 +02:00
Avi Kivity
8ce9697401 dhcp: wrap initializers with braces to prevent ambiguity 2014-11-26 14:59:49 +02:00
Asias He
1a1ff2a22a tcp: Fix get_isn
It should be microseconds instead of milliseconds.

Signed-off-by: Asias He <asias@cloudius-systems.com>
2014-11-26 13:26:54 +02:00
Asias He
fecf47b50a tcp: Defending against sequence number attacks
This patch implements initial sequence number generation algorithm per
RFC6528.
2014-11-26 12:34:16 +02:00
Gleb Natapov
cee8eb3121 net: remove unused function from net/native-stack.hh 2014-11-26 12:19:47 +02:00
Avi Kivity
9eea1752b0 Merge branch 'asias/tcp' of github.com:cloudius-systems/seastar-dev
TCP improvements from Asias.
2014-11-25 11:58:47 +02:00