The hand-rolled deleter chaining in packet::append was invalidated
by the make_free_deleter() optimization, since deleter->_next is no longer
guaranteed to be valid (and deleter::operator->() is still exposed, despite
that).
Switch to deleter::append(), which does the right thing.
Fixes a memory leak in tcp_server.
This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.
The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
If the card supports this (and usually, it does), enable rx checksum
offloading by the card, and avoid calculating the checksums ourselves.
With rx checksum offloading, the card checks in incoming packets the
IP header checksum and the L4 (TCP or UDP) checksum, and gives us a
flag when one of them is wrong, meaning that we do not need to do these
calculations ourselves.
This patch improves memcached performance on my setup by almost 3%.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
- Query the port for its caps.
- Properly adjust the queue numbers according to the caps.
- Enable RSS only if the final queues number is greater than 1.
- Enable Rx VLAN stripping.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Create a new class dpdk_eal that initializes DPDK EAL.
- Get rid of portmask crap and provide a port index to a dpdk::net_device
constructor.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
ipv4::handle_on_cpu() did not properly convert from network byte order, so
it saw any packets with DF=1 as fragmented.
Fix by applying the proper conversion.
If a TCP or UDP IP datagram is fragmented, only the first fragment will
contain the port information. When a fragment without port information
is received, we have no idea which "stream" this fragment belongs to,
thus we no idea how to forward this packet.
To solve this problem, we use "forward twice" method. When IP datagram
which needs fragmentation is received, we forward it using the
frag_id(src_ip, dst_ip, identification, protocol) hash. When all the
fragments are received, we forward it using the connection_id(src_ip,
src_port, dst_ip, dst_port) hash.
Move idle state management out from smp poller back to generic code. Each
poller returns if it did any useful work and generic code decided if it
should go idle based on that. If a poller requires constant polling it
should always return true.
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).
This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
- Added "dpdk-pmd" option:
- Defaulted to FALSE.
- When TRUE - use DPDK PMD drivers.
- Call for dpdk net_device creation function if dpdk-poll option is given
- Added DPDK networking backend options to all options list
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Currently only a single port and a single queue are supported.
- All DPDK EAL configuration is hard-coded in the dpdk_net_device constructor instead
of coming from the app parameters.
- No offload features are enabled.
- Tx: will spin in the dpdk_net_device::send() till there is a place in the HW ring to
place a current packet.
- Tx: copy data from the `packet` frags into the rte_mbuf's data.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
If we don't have split channels, we need to delete the relevant property.
because xs_rm() returns true if the feature does not exist, it won't affect the
transaction if we just delete all of them. Therefore we don't need to do any
conditional test.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We are adding everything we read into the features array. Because in the
destructor we will remove everything in the features list, we'll end up
removing more than we should. Things like the mac address, handle, etc, should
never be deleted.
This is not a problem for OSv because usually, after the destructor is called,
the whole guest is down. But for userspace, the network card is left there,
but will cease to work if we delete too much.
After we do that with the _features array - it's original intent, it becomes
reduntant with features nack.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
This is not required for OSv, but is required for userspace operation.
It won't work without it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Glauber says:
"This patch yields a small performance boost. It is not complete, since the rest
of the performance work is still missing since half of that is in OSv.
But more importantly, it now works on AWS."
When the backend advertises "feature-rx-copy", the frontend should register for
"request-rx-copy". The local hypervisor seems to be forgiving about it, but the
one in AWS, it is not, and doubly so.
First, it doesn't recognize these as the same. And second, it refuses to
connect the backend if this feature is not advertised by the frontend.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The ring processing is almost the same for both rx and tx, with the exception
with the core of the action. We can actually unify them nicely with some use of
template programming.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
There are two things we can do that will lead to less interrupts being sent.
The first, is to read the new rsp_cons value at the end of every interaction.
If the backend produces more frames in the mean time, we'll be able to process
in the same round, without getting another interrupt.
The other, is to set the rsp_event only after all the frames are processed.
As a matter of fact, both the tx and rx rings did one of them, but not the same
one. The next patch will unify the ring code to avoid problems like that in the
future.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Network device has to be available when network stack is created, but
sometimes network device creation should wait for device initialization
by another cpu. This patch makes it possible to delay network stack
creation until network device is available.