The command line linking with DPDK's libraries looked like a cross between
random character generator and black magic. Reading a bit on the DPDK
mailing list, it turns out there is method in this madness (flawed method,
but method nontheless):
1. Instead of using "-l..." they used "-Wl,-l..." everywhere. Turns out
they did this ugliness to "hide" this option from libtool.
We don't use libtool, and don't need to hide anything from it.
2. They used "--start-group ... --end-group" to avoid having to figure
out the right link order.
It was easy to figure out the right link order and avoid this option.
3. They used "--whole-archive" on all the DPDK libraries. Unfortunately,
this option *is* needed, because the way DPDK is written, it is not
suited to be compiled into an (non-shared) library: Each of the DPDK
drivers ("librte_pmd_*") has a constructor function which needs to
run to register itself. This works fine with shared libraries (whose
constructors are run on load) but with a ".a" library, the whole
library is left out because nothing from the outside refers to any
of its symbols.
So what we should do is to use --whole-archive only on the PMD drivers,
and all will be fully compiled into the generated program. The rest of
the DPDK libraries will be linked normally, and hopefully because we
don't use large parts of DPDK, big chunks will not be compiled in.
If we don't add this "--whole-archive", none of the drivers will be
compiled into the program, the initialization will not be able to
find any driver, and just complain there are no ethernet ports.
After this patch, Seastar with DPDK still compiles, and runs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Reviewed-by: Vlad Zolotarov <vladz@cloudius-systems.com>
With -fvisibility=hidden, all executable symbols are hidden from shared
objects, allowing more optimizations (especially with -flto). However, hiding
the allocator symbols mean that memory allocated in the executable cannot
be freed in a library, since they will use different allocators.
Fix by exposing these symbols with default visibility.
Fixes crash loading some dpdk libraries.
Current code assumes that memory is at node level, but on non numa
machines there is no node level at all. Instead of assuming memory
location in a topology search for it dynamically.
With gcc 4.9.2, build with DPDK enabled breaks with error like:
../dpdk-1.7.1/x86_64-native-linuxapp-gcc/include/rte_pci.h:99:37:
warning: invalid suffix on literal; C++11 requires a space between literal
and string macro [-Wliteral-suffix]
#define PCI_SHORT_PRI_FMT "%.2"PRIx8":%.2"PRIx8".%"PRIx8
The problem is that C++11 outlawed, breaking decades of proud C-preprocessor
tradition, using a macro if stuck to the end of a string. But this used
in DPDK's header files, so we need to turn this error into a warning
(let's keep the warning, hopefully it will disappear in newer versions
of DPDK).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Since recently, we also need the "libcrypto++-dev" package to compile
Seastar (libcrypto++ used by the TCP sequence number randomization...).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Currently each cpu creates network device as part of native networking
stack creation and all cpus create native networking stack independently,
which makes it impossible to use data initialized by one cpu in another
cpu's networking device initialization. For multiqueue devices often some
parts of an initialization have to be handled by one cpu and all other
cpus should wait for the first one before creating their network devices.
Even without multiqueue proxy devices should be created after master
device is created so that proxy device may get a pointer to the master
at creation time (existing code uses global per cpu device pointer and
assume that master device is created on cpu 0 to compensate for the lack
of ordering).
This patch makes it possible to delay native networking stack creation
until network device is created. It allows one cpu to be responsible
for creation of network devices on multiple cpus. Single queue device
initialize master device on one cpu and call other cpus with a pointer
to master device and its cpu id which are used in proxy device creation.
This removes the need for per cpu device pointer and "master on cpu 0"
assumption from the code since now master device and slave devices know
about each other and can communicate directly.
Use hwloc_get_next_obj_by_type() instead of directly following cousin
list and handle list wrap around. Also fixed use of uninitialized
variable (I wonder why compiler did not complain).
Current code crashes on an assert while dividing memory to cpus if number
of cpus seastar is configured to use is smaller then number of available
numa nodes. The reason is that seastar tries to use all available memory,
but considers only one numa node while dividing it. This patch makes
memory division two phase process: first each cpu tries to grub as
much memory from its local node as it can, second all free memory that
was left is divided between all cpus. The algorithm works like that to
prevent one cpu from stealing local memory from another cpu.
dpdk support from Vlad:
"- Currently only a single port and a single queue are supported.
- All DPDK EAL configuration is hard-coded in the dpdk_net_device constructor instead
of coming from the app parameters.
- No offload features are enabled.
- Tx: will spin in the dpdk_net_device::send() till there is a place in the HW ring to
place a current packet.
- Tx: copy data from the `packet` frags into the rte_mbuf's data."
- Fixed the IP addresses swapping.
- Added cmdline parameters to choose between virtio and DPDK tests.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Added "dpdk-pmd" option:
- Defaulted to FALSE.
- When TRUE - use DPDK PMD drivers.
- Call for dpdk net_device creation function if dpdk-poll option is given
- Added DPDK networking backend options to all options list
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Currently only a single port and a single queue are supported.
- All DPDK EAL configuration is hard-coded in the dpdk_net_device constructor instead
of coming from the app parameters.
- No offload features are enabled.
- Tx: will spin in the dpdk_net_device::send() till there is a place in the HW ring to
place a current packet.
- Tx: copy data from the `packet` frags into the rte_mbuf's data.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Each "poller" registers a non-blocking callback which is then called in
every iteration of a reactor's main loop.
Each "poller"'s callback returns a boolean: if TRUE then a main loop is allowed to block
(e.g. in epoll()).
If any of registered "pollers" returns FALSE then reactor's main loop is forbidded to block
in the current iteration.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
The local variable used to read the ports won't be valid after we return from
the function. Moving it to be an instance member is not ideal, but it work if
we don't unmask the ports until we're ready signaling them all.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
If we don't have split channels, we need to delete the relevant property.
because xs_rm() returns true if the feature does not exist, it won't affect the
transaction if we just delete all of them. Therefore we don't need to do any
conditional test.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
If there is some error opening the xenstore - for instance, if we run
without privileges, we should bail out or we will segfault later.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
We are adding everything we read into the features array. Because in the
destructor we will remove everything in the features list, we'll end up
removing more than we should. Things like the mac address, handle, etc, should
never be deleted.
This is not a problem for OSv because usually, after the destructor is called,
the whole guest is down. But for userspace, the network card is left there,
but will cease to work if we delete too much.
After we do that with the _features array - it's original intent, it becomes
reduntant with features nack.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
This is not required for OSv, but is required for userspace operation.
It won't work without it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Glauber says:
"This patch yields a small performance boost. It is not complete, since the rest
of the performance work is still missing since half of that is in OSv.
But more importantly, it now works on AWS."
When the backend advertises "feature-rx-copy", the frontend should register for
"request-rx-copy". The local hypervisor seems to be forgiving about it, but the
one in AWS, it is not, and doubly so.
First, it doesn't recognize these as the same. And second, it refuses to
connect the backend if this feature is not advertised by the frontend.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The ring processing is almost the same for both rx and tx, with the exception
with the core of the action. We can actually unify them nicely with some use of
template programming.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
There are two things we can do that will lead to less interrupts being sent.
The first, is to read the new rsp_cons value at the end of every interaction.
If the backend produces more frames in the mean time, we'll be able to process
in the same round, without getting another interrupt.
The other, is to set the rsp_event only after all the frames are processed.
As a matter of fact, both the tx and rx rings did one of them, but not the same
one. The next patch will unify the ring code to avoid problems like that in the
future.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Network device has to be available when network stack is created, but
sometimes network device creation should wait for device initialization
by another cpu. This patch makes it possible to delay network stack
creation until network device is available.