Since we control the capacity, we can force it to be a power of two,
and use masking instead of tests to handle wraparound.
A side benefit is that we don't have to allocate an extra element.
We're currently using boost::lockfree::consume_all() to consume
smp requests, but this has two problems:
1. consume_all() calls consume_one() internally, which means it accesses
the ring index once per message
2 we interleave calling the request function with accessing the ring, which
allows the other side to access the ring again, bouncing ring cache lines.
Fix by copying all available items in one show, using pop(array), and then
processing them afterwards.
We're currently using boost::lockfree::consume_all() to consume
smp completions, but this has two problems:
1. consume_all() calls consume_one() internally, which means it accesses
the ring index once per message
2 we interleave calling the request function with accessing the ring, which
allows the other side to access the ring again, bouncing ring cache lines.
Fix by copying all available items in one show, using pop(array), and then
processing them afterwards.
Instead of incurring the overhead of pushing a message down the queue (two
cache line misses), amortize of over 16 messages (3/4 cache line misses per
batch).
Batch size is limited by poll frequency, so we should adjust that
dynamically.
If it needs to be resized, it will cause a deallocation on the wrong cpu,
so initialize it on the sending cpu.
Does not break with circular_buffer<>, but it's not going to be a
circular_buffer<> for long.
Instead of placing packets directly into the virtio ring, add them to
a temporary queue, and flush it when we are polled. This reduces
cross-cpu writes and kicks.
This is a little tricky, since we only know we want hugetlbfs after memory
has been initialized, so we start up in anonymous memory, and later
switch to hugetlbfs by copying it to hugetlb-backed memory and mremap()ing
it back into place.
This patch uses the NIC's capability to calculate in hardware the IP, TCP
and UDP checksums on outgoing packets, instead of us doing this on the
sending CPU. This can save us quite a bit of calculations (especially for
the TCP/UDP checksum of full-sized packets), and avoid cache-polution on
the CPU when sending cold data.
On my setup this patch improves the performance of a single-cpu memcached
by 6%. Together with the recent patch for receive-side checksum offloading,
the total improvement is 10%.
This patch is somewhat complicated by the fact we have so many different
combinations of checksum-offloading capabilities; While virtio can only
offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and
layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP
(e.g., ICMP), and some packets are not even IP (e.g., ARP), so this
patch modifies a few of the hardware-features flags and the per-packet
offload-information flags to fit our new needs.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Since we switched temporary_buffer to malloc(), it now longer throws
an exception after running out of memory, which leads to a segfault
when referencing a null buffer.
The hand-rolled deleter chaining in packet::append was invalidated
by the make_free_deleter() optimization, since deleter->_next is no longer
guaranteed to be valid (and deleter::operator->() is still exposed, despite
that).
Switch to deleter::append(), which does the right thing.
Fixes a memory leak in tcp_server.
While make_free_deleter(nullptr) will function correctly,
deleter::operator bool() on the result will not.
Fix by checking for null, and avoiding the free deleter optimization in
that case -- it doesn't help anyway.
This patch adds new class distributed_device which is responsible for
initializing HW device and it is shared between all cpus. Old device
class responsibility becomes managing rx/tx queue pair and it is local
per cpu. Each cpu have to call distributed_device::init_local_queue() to
create its own device. The logic to distribute cpus between available
queues (in case there is no enough queues for each cpu) is in the
distributed_device currently and not really implemented yet, so only one
queue or queues == cpus scenarios are supported currently, but this can
be fixed later.
The plan is to rename "distributed_device" to "device" and "device"
to "queue_pair" in later patches.
If start promise on initial cpu is signaled before other cpus have
networking stack constructed collected initialization crashes since it
tries to create a UDP socket on all available cpus when initial one is
ready.
A future that does not carry any data (future<>) and its sibling (promise<>)
are heavily used in the code. We can optimize them by overlaying the
future's payload, which in this case can only be an std::exception_ptr,
with the future state, as a pointer and an enum have disjoint values.
This of course depends on std::exception_ptr being implemented as a pointer,
but as it happens, it is.
With this, sizeof(future<>) is reduced from 24 bytes to 16 bytes.