Recursion takes up space on stack which takes up space in caches which
means less room for useful data.
In addition to that, a limit on iteration count can be larger than the
limit on recursion, because we're not limited by stack size here.
Also, recursion makes flame-graphs really hard to analyze because
keep_doing() frames appear at different levels of nesting in the
profile leading to many short "towers" instead of one big tower.
This change reuses the same counter for limiting iterations as is used
to limit the number of tasks executed by the reactor before polling.
There was a run-time parameter added for controlling task quota.
Assuming the output_stream size is set to 8K, a sequence of writes of
lengths: 128B, 8K, 128B would yield three fragments of exactly those
sizes. This is not optimal as one could fit those in just 2 fragments
of up to 8K size. This change makes the output_stream yield 8K and
256B fragments for this case.
output_stream can be used by only one fiber at a time so from
correctness point of view it doesn't matter if we set _end before or
after put(), but setting it before it allows us to have one future
less, which is a win.
We store spans in freelist i if the span's size >= 2^i. However, when
picking a span to satisfy an allocation, we must use the next larger list
if the size is not a power of two, so that we can be sure that all spans on
that list can satisfy that request.
The current code doesn't do that, so it under-allocates, leading to memory
corruption.
It concatenates multiple string-like entities in one go and gives away
an sstring. It does at most one allocation for the final sstring and
one copy per each string. Works with heterogenous arguments, both
sstrings and constant strings are supported, string_views are planned.
The reactor is currently designed around the concept of file descriptors
and polling them. Every source of events is a file descriptor, and those
which are not, like timers, signals and inter-thread notifications, are
"converted" to file-descriptor events using timerfd, signalfd and eventfd
respectively.
But for running OSv with a directly assigned virtio device, we don't want
to use file descriptors for notifications: When we need each interrupt
to signal an eventfd, this is slow, and also problematic because file
descriptors contain locks so we can't signal an eventfd at interrupt
time, causing the existing code to use an extra thread to do this.
So this patch refactors the reactor to allow the main loop to be based
no just on file descriptors, but on a different type of abstractions.
We have a reactor_backend (with epoll and osv implementation), to which we
We don't add "file descriptors" but rather more abstract notions like
timer, signal or "notifier" (similar to eventfd). The Linux epoll
implementation indeed uses file descriptors internally (with timer
using a timerfd, signal using signalfd and notifier using eventfd)
but the OSv implementation does not use file descriptors.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
From Tomasz:
"There will be now a separate DB per core, each serving a subset of the key
space (sharding). From the outside in appears to behave as one DB."
Use like this:
engine.at_exit([] {
std::cout << "so long!\n";
return make_ready_future<>();
});
All lambdas will be executed when reactor is stopped, in order, on the
same CPU on which they were registerred.
POSIX stack does not allow one to bind more than one socket to given
port. Native stack on the other hand does. The way services are set up
depends on that. For instance, on native stack one might want to start
the service on all cores, but on POSIX stack only on one of them.
Fixes assert failure during ^C:
#0 0x0000003e134348c7 in raise () from /lib64/libc.so.6
#1 0x0000003e1343652a in abort () from /lib64/libc.so.6
#2 0x0000003e1342d46d in __assert_fail_base () from /lib64/libc.so.6
#3 0x0000003e1342d522 in __assert_fail () from /lib64/libc.so.6
#4 0x0000000000409a7c in boost::intrusive::list_impl<boost::intrusive::mhtraits<timer, boost::intrusive::list_
at /usr/include/boost/intrusive/list.hpp:1263
#5 0x00000000004881cc in iterator_to (this=<optimized out>, value=...) at core/timer-set.hh:71
#6 reactor::del_timer (this=<optimized out>, tmr=tmr@entry=0x60000005cda8) at core/reactor.cc:287
#7 0x00000000004682a5 in ~timer (this=0x60000005cda8, __in_chrg=<optimized out>) at ./core/reactor.hh:974
#8 ~resolution (this=0x60000005cd90, __in_chrg=<optimized out>) at net/arp.hh:86
#9 ~pair (this=0x60000005cd88, __in_chrg=<optimized out>) at /usr/include/c++/4.9.2/bits/stl_pair.h:96
Currently semaphore is used to keep track off free space in smp queue,
but our semaphore does not guaranty that order in which tasks call wait()
will be the same order they will get access to a resource. This may cause
packet reordering in smp which is not desirable for TCP performance. This
patch replaces the semaphore with a simple counter and another queue to
hold items that cannot be places into smp queue due to lack of space.
The current code (this will change soon with my reactor patches)
constructs a default (Posix) network stack before reactore::configure()
reassigns it to the requested network stack.
It turns out there is one place we use the network stack before calling
reactore::configure(), which ends up using the Posix stack even though
we want the native stack - this is both silly and plainly doesn't work on
the OSv setup.
The problem is that app_template.hh tries to configure scollectd before
the engine is started. This calls scollectd::impl::start() which calls
engine.net().make_udp_channel(). When this happens this early, it creates
a Posix socket...
This patch moves the scollectd configuration to after the engine is
started. It makes sense to me: As far as I understand, scollectd is all
about sending packets (diagnostic packets), and it's kind of silly to
start sending packets before starting the machinary which allows us to
send packets.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
[avi: use customary indentation, remove unneeded make_ready_future()]
Use when you don't want to care about the result and just want to
return a future<>.
The current implementation may not be the most optimal way to do it
but it can be improved later if there's need.
It's more convenient for users that way. If someone wants to pass a
reference, we use a reference. If he passes an r-value, we accept it
and use parameter l-value instead.
This patch adds "smp queue polling before going idle" to the reactor.
It allows to avoid signalfd overhead in case receiver thread is not idle
when message is sent. With this patch on top of two other patches from
me that are still waiting to be committed I see 450120 Requests/sec with
wrk and "httpd -c 2 --network-stack native" with native stack. With one
cpu the result is 316002, so we have around 40% scaling. The bottleneck
in this test is cpu 0 which takes 100% cpu time.
This is useful for features that are provided incrementally, so may not
be present on all hypervisors. If the value is not present, return a
user-provided default, which also has a system-provided default (0).
We current have one port per event channel. We need to have a list of
semaphores that will all be made ready when an interrupt kicks in. This is
useful in the case where both tx and rx are bound to the same event channel.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
If we do that, plus make it an instance method, we should be able to use
make_ready_port. This is consistent with the userspace implementation and
from that point any changes there will be propagated to both.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
The representation of an event channel as an integer poses a problem, in which
waiting on an integer port doesn't work well when the same event channel is
assigned for both tx and rx. The future will be ready for one of the sides, but
we won't process the other.
One alternative is to have conditions in the future processing, and in case the
event channels are bound to the same port, process both events. But a better
solution is to use a class to represent the bound ports, and instances of those
classes will have their own pending methods.
Infrastructure will be written in a following patch to make sure that all
listeners to the same port will be made ready when an interrupt kicks in
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Instead of returning a reference to a grant that is already present in an
array, defer the initialization. This is how the OSv driver handles it, and I
honestly am not sure if this is really needed: it seems to me we should be able
to just reuse the old grants. I need to check in the backend code if we can be
any smarter than this.
However, right now we need to do something to recycle the buffers, and just
re-doing the refs would lead to inconsistencies. So the best by now is to close
and reopen the grants, and then later on rework this in a way that works for
both the initial setup and the recycle.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Enhance gntref with some useful operations. Also provide a default object that
represents an invalid grant.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Some packets, like arp replies, are broadcast to all cpus for handling,
but only packet structure is copied for each cpu, the actual packet data
is the same for all of them. Currently networking stack mangles a
packet data during its travel up the stack while doing ntoh()
translations which cannot obviously work for broadcaster packets. This
patches fixes the code to not modify packet data while doing ntoh(), but
do it in a stack allocated copy of a data instead.
The Xen code registers a function that calls semaphore::signal as
an interrupt handler, however that function is not smp safe and may crash,
and in events it generates are likely to be ignored, since they are just
appended to the reactor queue without any real wakeup to the reactor thread.
Switch to using an eventfd. That's still unsafe, but a little better, since
its signalling is smp safe, and will cause the reactor thread to wake up
in case it was asleep.
With this, we are able to receive multiple packets.
We used gnttab_grant_foreign_access() instead of
gnttab_grant_foreign_access_ref(). While the two functions have similar
enough signatures, they do very different things.
With the change, we are able to receive packets from Xen, though we crash
immediately.