When we have an object acting as resource guard for memory, we can convert
it into a deleter using
make_deleter([obj = std::move(obj)] {})
introduce a simpler interface
make_object_deleter(std::move(obj))
for doing the same thing.
Some (all?) RSS capable HW provides us with a hash that was used to
select rx queue the packet was delivered to. If such hash is available
it is better to use it to forward packet instead of calculating hash
ourself and suffering cache missed.
This patch introduce a logic to divide cpus between available hw queue
pairs. Each cpu with hw qp gets a set of cpus to distribute traffic
to. The algorithm doesn't take any topology considerations into account yet.
Instead of forward() deciding packet destination make it collect input
for RSS hash function depending on packet type. After data is collected
use toeplitz hash function to calculate packet's destination.
Instead of returning special value from forward() to broadcast arm reply
call arp.learn() on all cpus at arp protocol lever. The ability of
forward() to return special value will be removed by later patches.
Currently dhcp assumes that cpu 0 gets all the packets and redistributes
them by itself. With multiqueue this is not necessary the case, so the
current trick to disable forwarding by installing special dhcp forward()
function will not work. Rework it by installing packet filter on all
cpus before running dhcp and forward all dhcp packets to cpu 0.
From Asias:
"Add a low resolution clock source in addition to what std::chrono provides.
With it we can reduce the expensive std::chrono::high_resolution_clock::now()
calls."
We look at _poll mode in another cpu's cache accidentally, as pard of
the peer->idle() call.
Fix by looking at our own _poll variable first; they should all be the same.
Futures are great for complicated asynchronous operations, but for a
synchronous operation like destroying a packet after transmit, or
converting a buffer to a packet during receive, they're overkill.
This patchset fixes those two cases in virtio, in which futures
are used as an abstraction layer between vring and the transmit/receive
queues, by converting vring into a template, so that the completion function
can be adjusted for the transmit or receive case during compile time instead
of at run time.
10% improvement on httpd with --smp 1, >20% with --smp 3.
Move completion handling (destroy packet, adjust descriptors count) to
a completion function rather than a future. Reduces allocations and task
executed.
Move completion handling (destroy packet, adjust descriptors count) to
a completion function rather than a future. Reduces allocations and task
executed.
Currently vring request completions are handled by fulfilling a promise
contained in the request. While promises are very flexible, this comes
at a cost (allocating and executing a task), and this flexibility is unneeded
when request handling is very regular (such as in virtio-net rx and tx
completion handling).
Make vring more flexible by allowing the completion function to be specified
as a template parameter. No changes to the actual users - they now specify
the completion function as fulfilling the same promise as vring previously
did.
wait_and_process() expects an std::function<>, but we pass it a lambda,
forcing it to allocate.
Prepare the sdt::function<> in advance, so it can pass by reference.
Since we control the capacity, we can force it to be a power of two,
and use masking instead of tests to handle wraparound.
A side benefit is that we don't have to allocate an extra element.
We're currently using boost::lockfree::consume_all() to consume
smp requests, but this has two problems:
1. consume_all() calls consume_one() internally, which means it accesses
the ring index once per message
2 we interleave calling the request function with accessing the ring, which
allows the other side to access the ring again, bouncing ring cache lines.
Fix by copying all available items in one show, using pop(array), and then
processing them afterwards.
We're currently using boost::lockfree::consume_all() to consume
smp completions, but this has two problems:
1. consume_all() calls consume_one() internally, which means it accesses
the ring index once per message
2 we interleave calling the request function with accessing the ring, which
allows the other side to access the ring again, bouncing ring cache lines.
Fix by copying all available items in one show, using pop(array), and then
processing them afterwards.
Instead of incurring the overhead of pushing a message down the queue (two
cache line misses), amortize of over 16 messages (3/4 cache line misses per
batch).
Batch size is limited by poll frequency, so we should adjust that
dynamically.
If it needs to be resized, it will cause a deallocation on the wrong cpu,
so initialize it on the sending cpu.
Does not break with circular_buffer<>, but it's not going to be a
circular_buffer<> for long.
Instead of placing packets directly into the virtio ring, add them to
a temporary queue, and flush it when we are polled. This reduces
cross-cpu writes and kicks.