Instead of using inefficient std::ostream, use our own 'bytes' iterator class.
Compute ahead of time the length of the byte buffer.
Afterwards serialize the objects into it.
Gives ~X5 boost over previus results (that sometimes don't even
finish in reasonable time)
[avi: add missing include]