Some packets are processed by a cpu other than the one that allocates it
and its fragments. free_on_cpu() function should be called on a cpu that
does processing and it returns a packet that is deletable by the current
cpu. It is done by copying packet/packet::impl to locally allocated one
and adding new deleter that runs old deleter on original cpu.
Instead of using internal_deleter, which is unwieldy, store the
header data inside packet::impl which we're allocating anyway.
This adds some complication when we need to reallocate impl (if
the number of fragments overflows), but usually saves two allocations:
one for the internal_deleter and one for the data itself.
Instead of having an std::vector<> manage the fragment array,
allocate it at the end of the impl struct and manage it manually.
The result isn't pretty but it does remove an allocation.
Move all data fields into an 'impl' struct (pimpl idiom) so that move()ing
a packet becomes very cheap. The downside is that we need an extra
allocation, but we can later recover that by placing the fragment array
in the same structure.
Even with the extra allocation, performance is up ~10%.