Commit Graph

865 Commits

Author SHA1 Message Date
Vlad Zolotarov
47b3721ccf reactor: added a "pollers" abstraction
Each "poller" registers a non-blocking callback which is then called in
every iteration of a reactor's main loop.

Each "poller"'s callback returns a boolean: if TRUE then a main loop is allowed to block
(e.g. in epoll()).

If any of registered "pollers" returns FALSE then reactor's main loop is forbidded to block
in the current iteration.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2014-11-30 12:12:39 +02:00
Asias He
88a1a37a88 ip: Support IP fragmentation in TX path
Tested with UDP sending large datagrams with ufo off.
2014-11-30 10:16:38 +02:00
Avi Kivity
f4daca803d Merge branch 'glommer/xen' of github.com:cloudius-systems/seastar-dev
Xen fixes (userspace + osv) from Glauber.
2014-11-29 14:06:27 +02:00
Glauber Costa
2cf187590f xen: fix userspace interrupts
The local variable used to read the ports won't be valid after we return from
the function. Moving it to be an instance member is not ideal, but it work if
we don't unmask the ports until we're ready signaling them all.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-28 14:23:14 +01:00
Glauber Costa
b3c163e603 xen: fix typo in event channel detection
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-28 14:15:36 +01:00
Glauber Costa
c3ae30b760 xen: delete event channel as well
If we don't have split channels, we need to delete the relevant property.
because xs_rm() returns true if the feature does not exist, it won't affect the
transaction if we just delete all of them. Therefore we don't need to do any
conditional test.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Glauber Costa
a4667c48e6 xen: fix gntalloc for userspace
It broke when we changed things to accomodate OSv's functions. The following
code works.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Glauber Costa
f06233695c xenstore: bail on error
If there is some error opening the xenstore - for instance, if we run
without privileges, we should bail out or we will segfault later.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Glauber Costa
3848130f2f xen: only add features to feature array
We are adding everything we read into the features array. Because in the
destructor we will remove everything in the features list, we'll end up
removing more than we should. Things like the mac address, handle, etc, should
never be deleted.

This is not a problem for OSv because usually, after the destructor is called,
the whole guest is down. But for userspace, the network card is left there,
but will cease to work if we delete too much.

After we do that with the _features array - it's original intent, it becomes
reduntant with features nack.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Glauber Costa
bd8a18c178 xen: umask event channels when setup is ready
This is not required for OSv, but is required for userspace operation.
It won't work without it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-27 18:00:35 +01:00
Avi Kivity
861957e5ba Merge branch 'glommer/xen' of github.com:cloudius-systems/seastar-dev
Glauber says:

"This patch yields a small performance boost. It is not complete, since the rest
of the performance work is still missing since half of that is in OSv.

But more importantly, it now works on AWS."
2014-11-26 18:30:26 +02:00
Glauber Costa
b56a89d5c9 xen: translate feature name
When the backend advertises "feature-rx-copy", the frontend should register for
"request-rx-copy". The local hypervisor seems to be forgiving about it, but the
one in AWS, it is not, and doubly so.

First, it doesn't recognize these as the same. And second, it refuses to
connect the backend if this feature is not advertised by the frontend.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-26 17:22:58 +01:00
Glauber Costa
a9a79e3ba6 xen: ring unification
The ring processing is almost the same for both rx and tx, with the exception
with the core of the action. We can actually unify them nicely with some use of
template programming.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-26 17:21:09 +01:00
Glauber Costa
e7c9aeb8a5 xen: interrupt mitigation
There are two things we can do that will lead to less interrupts being sent.
The first, is to read the new rsp_cons value at the end of every interaction.
If the backend produces more frames in the mean time, we'll be able to process
in the same round, without getting another interrupt.

The other, is to set the rsp_event only after all the frames are processed.

As a matter of fact, both the tx and rx rings did one of them, but not the same
one. The next patch will unify the ring code to avoid problems like that in the
future.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-26 17:17:45 +01:00
Gleb Natapov
4f4731c37b net: delay network stack creation
Network device has to be available when network stack is created, but
sometimes network device creation should wait for device initialization
by another cpu. This patch makes it possible to delay network stack
creation until network device is available.
2014-11-26 16:46:04 +02:00
Avi Kivity
87fdf52205 Merge branch 'clang' 2014-11-26 15:01:14 +02:00
Avi Kivity
e8894227bc xen: declare nr_ents higher to satisfy clang 2014-11-26 15:00:13 +02:00
Avi Kivity
8ce9697401 dhcp: wrap initializers with braces to prevent ambiguity 2014-11-26 14:59:49 +02:00
Avi Kivity
58487b55d4 smp: massage init captures to satisfy clang 2014-11-26 14:59:03 +02:00
Avi Kivity
44c3e9fc04 collectd: wrap initializers with braces
Helps prevents ambiguity with constructors that accept multiple parameters.
2014-11-26 14:57:58 +02:00
Avi Kivity
c30b3e93c2 reactor: massage collectd registrations to satisfy clang
Warns of an unused variable, even though the destructor has side effects.
2014-11-26 14:57:02 +02:00
Avi Kivity
9ab5dce5c4 memory: fix throw specifiers on sized delete
Noticed by clang.
2014-11-26 14:56:40 +02:00
Avi Kivity
9c7fc9d5d1 memcache: massage init capture to satisfy clang 2014-11-26 14:56:18 +02:00
Avi Kivity
05e8ee5e0c memcache: remove unneeded use of variable length array
Noticed by clang.
2014-11-26 14:55:30 +02:00
Avi Kivity
239f4a3bf5 memcache: remove unused subdevice::_length
Noticed by clang.
2014-11-26 14:55:01 +02:00
Asias He
1a1ff2a22a tcp: Fix get_isn
It should be microseconds instead of milliseconds.

Signed-off-by: Asias He <asias@cloudius-systems.com>
2014-11-26 13:26:54 +02:00
Asias He
fecf47b50a tcp: Defending against sequence number attacks
This patch implements initial sequence number generation algorithm per
RFC6528.
2014-11-26 12:34:16 +02:00
Gleb Natapov
01e9410adc smp: move thread creation sync point after start_all_queues()
Configure all smp queues before calling engine.configure() so that
engine.configure() may use submit_to() api. Note that messages will
still be processed only after engine.run() is executed.
2014-11-26 12:20:04 +02:00
Gleb Natapov
cee8eb3121 net: remove unused function from net/native-stack.hh 2014-11-26 12:19:47 +02:00
Avi Kivity
33ed01d354 Merge branch 'flashcache' of github.com:cloudius-systems/seastar-dev
From Raphael:

"Flashcache is basically an extension of memcache where a flash device is used to achieve a considerably higher cache hit ratio (~130x better).

Flashcache major additions:
-----
* Flashcache device length is divided by the number of CPUs, where each portion is then assigned to a per-cpu cache.

* Let me readily mention that items aren't stored on disk, but instead data from items. Keys always remain stored in memory.

* Each item has now a state field that describes its status.
* Each item can be in any of the following states:
- MEM (Item is stored only in memory)
- TO_MEM_DISK (Transition from MEM to MEM_DISK state)
- MEM_DISK (Item is stored both in memory and on disk)
- DISK(Item is stored only on disk)
- ERASED (Item was invalidated)

* Algorithm added to balance items between MEM and MEM_DISK state.
* Three LRU lists were added to keep track of MEM, MEM_DISK and DISK items.
* When item is ERASED, it shouldn't be in any of the lists above.
* When the working set fits memory, items should only be stored in MEM and MEM_DISK lists.

* Upon a SET request, the ratio of MEM and MEM_DISK (MEM_DISK / (MEM + MEM_DISK)) is taken into account to decide whether or not a LRU item should be moved to MEM_DISK state (consists of scheduling a LRU item to be stored on disk, where its data field remains intact).

* Before an item is scheduled to be moved to MEM_DISK state, it's set to the transition state called TO_MEM_DISK. Why? It's basically to handle client requests on transitioning items. Example: For get requests, let's only provide the data given that the data remains intact.

* Upon memory pressure, a specialized reclaiming function is called to do the following:
get a LRU item from MEM_DISK list that has no readers (i.e. refcount is zero); remove it from MEM_DISK list, erase the data; set its state to DISK;
The steps above are executed repeatedly until the request amount of memory reclaimed is satisfied.

* Upon a GET request on a DISK item, a per-item semaphore is used to guarantee that the first request will proceed with the loading of the data from the flash device, while the others wait for the process to complete.

* ERASED state is used to inform flashcache that an item was invalidated and thus shouldn't be moved to any list. E.G. invalidation request could happen while the data from an item is being loaded from disk.

Result:
-----

Performance is worse (unfortunate but also expected because of time waiting for items to be loaded) but hit ratio is considerably better as also expected. I'm thinking of adding a new state for items called LOADED that, when the data from the item is loaded from disk, mark the item as LOADED; insert it into MEM list; and schedule an item from MEM list to be moved to MEM_DISK list. That may bring a good performance benefit, no data to back up my claim though. By the time being, an item loaded is directly moved to MEM_DISK list (as its data is already stored on disk), where it then could be quickly evicted upon a memory pressure.

$ sudo ./memcached --stats --device /dev/sdb --mem 600M (POSIX stack)

* MEMCACHE - TCP:
$ memaslap -T 4 -s 127.0.0.1 -t 60s -c 256
servers : 127.0.0.1
threads count: 4
concurrency: 256
run time: 60s
windows size: 10k
set proportion: set_prop=0.10
get proportion: get_prop=0.90
cmd_get: 6310281
cmd_set: 701266
get_misses: 1783262
written_bytes: 1216572735
read_bytes: 5039263122
object_bytes: 762977408

Run time: 60.0s Ops: 7011547 TPS: 116837 Net_rate: 99.4M/s

* FLASHCACHE - TCP:
$ memaslap -T 4 -s 127.0.0.1 -t 60s -c 256
servers : 127.0.0.1
threads count: 4
concurrency: 256
run time: 60s
windows size: 10k
set proportion: set_prop=0.10
get proportion: get_prop=0.90
cmd_get: 3067576
cmd_set: 340959
get_misses: 13576
written_bytes: 591452430
read_bytes: 3392472330
object_bytes: 370963392

Run time: 60.0s Ops: 3408535 TPS: 56804 Net_rate: 63.3M/s"
2014-11-25 13:42:14 +02:00
Raphael S. Carvalho
35f37a4235 memcache: generate flashcache
flashcached.cc and memcached.cc files were created to generate
flashcached and memcached respectively through a template parameter.
2014-11-25 09:10:33 -02:00
Raphael S. Carvalho
300b310a27 memcache: move ./memcached.cc to ./memcache.cc
Actual purpose is explained by the subsequent commit.
2014-11-25 09:10:33 -02:00
Raphael S. Carvalho
087038bd47 memcache: flashcache integration
flashcached isn't generated by the build process yet, please check
subsequent commits.
2014-11-25 09:05:13 -02:00
Avi Kivity
8b632ca2fb Revert bogus allocator/deleter commits
Reverts:
e0df395124
   "Add make_free_deleter"
cfd8a1f997
   "core: special-case deleter for raw memory""
5eaecc8805
   "Use default allocator"

Introduced accidentally.
2014-11-25 12:06:13 +02:00
Avi Kivity
9eea1752b0 Merge branch 'asias/tcp' of github.com:cloudius-systems/seastar-dev
TCP improvements from Asias.
2014-11-25 11:58:47 +02:00
Asias He
bd0849f40b tcp: Send ACK immediately when segment fills a gap arrives
See RFC5681: 3.2. Fast Retransmit/Fast Recovery for more details.

"""
In addition, a TCP receiver SHOULD send an immediate ACK when the
incoming segment fills in all or part of a gap in the sequence space.
"""
2014-11-25 16:31:42 +08:00
Asias He
e1f4499b28 tcp: Send ACK immediately when out of order segment arrives
See RFC5681: 3.2. Fast Retransmit/Fast Recovery for more details.
2014-11-25 15:59:48 +08:00
Avi Kivity
49c17db25e Merge branch 'glommer/xen' of github.com:cloudius-systems/seastar-dev
From Glauber:

"Before those patches, Xen was not surviving a full round of wrk. Now it
survives a 20min one. That doesn't mean it is devoid of bugs: I am still seeing
some warnings being generated, so there is definitely more work to do. But at
least it doesn't crash and is stable.

Performance wise, Xen+OSv fares at 32k req/sec in my laptop, where lwan does
45k"
2014-11-25 09:55:12 +02:00
Gleb Natapov
8a754386c2 net: remove unused variable in native_network_stack 2014-11-25 09:54:44 +02:00
Asias He
e14674ff3c tcp: Improve merge_out_of_order
In case of seg_beg > _rcv.need, we can stop looking since seg_beg can
grow only.
2014-11-25 15:44:59 +08:00
Asias He
2a3ce92b19 tcp: Reduce maximum delayed timer
The maximum delayed ack timer allowed by RFC1122 is 500ms, most
implementations use 200ms by default, including Windows and Linux.
2014-11-25 15:39:33 +08:00
Glauber Costa
dd8c5a3521 xen: fix index calculation
The xen protocol needs works by filling positions in a circular ring. The
indexes become free to be used again when they are processed by the other side.

There is a problem, however: those indexes must be sequential, because all the
sides share is a produced / consumed index. But there are situations in which
we call get_index() - which produces an index X, but the .then() clause
schedules some other caller of send() to run in our place. That one, in turn,
can call get_index(), then create a packet with index X + 1 that will be put in
the ring before the packet with index X.

If the other end processes this packet very fast, it will respond saying "I
have processed packets up to X + 1". We will act on it as marking X as
processed as well - since it comes before X + 1, and when X is really
processed, chaos will ensue.

The solution for that is to just have the semaphore to count how many spaces we
have in the ring. Once we guarantee that the current caller have space, we then
compute get_index() inside the .then() clause. This works well because the
indexes are all sequential anyway.

For the same reason, we are actually able to remove the queue, and resort to a
simple counter. Once we know there is room, we just get the next index,
whatever it may be.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-24 17:01:14 +01:00
Glauber Costa
3f67c12925 xen: make idx method static
It does not depend on any instance member.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-24 16:44:05 +01:00
Glauber Costa
3c195d25e6 xen: useful assert
we can't reach this place with a negative ref id, so let's assert to make sure
we're fine. Help catching some bugs.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-24 16:42:55 +01:00
Glauber Costa
fa252087c4 xen: use the right index
The index in the ring and the packet id tends to be the same. But it doesn't
have to.  There are some situations where the backend and the frontend get out
of sync with this, and this is totally valid.

One example is when the backend skb already have enough room to hold all of the
data being transmitted (netback.c, line 1611 @3.16). The netback will respond
immediately, even though there are other pending packets that are not yet fully
processed.

The ring index, then, must come from the rsp value, not from the req/rsp id.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2014-11-24 16:38:38 +01:00
Asias He
e0df395124 Add make_free_deleter 2014-11-24 18:16:25 +08:00
Asias He
cfd8a1f997 Revert "core: special-case deleter for raw memory"
This reverts commit f75d1822cc.
2014-11-24 18:16:25 +08:00
Asias He
5eaecc8805 Use default allocator 2014-11-24 18:16:25 +08:00
Asias He
35186f659a tcp: Fix transmission
When a bulk of data is passed from user application, the TCP layer call
output only once to send data. This will slow TX a lot, because the
output will send at most MSS size of data while we might have way more
than MSS to send. We will send again only after remote ack the data we
just sent. This slowness can be seen easily with tso turned off.

To fix, we should send as much as we are allowed to. This patch boosts
TX bandwidth from 0.N MiB/Sec to hundreds MiB/Sec.

Before:
[asias@hjpc pingpong]$ go run client-txtx.go
Server:  192.168.66.123:10000
Connections:  1
Bytes Received(MiB):  10
Total Time(Secs):  76.217338072
Bandwidth(MiB/Sec):  0.13120374252054473

After:
[asias@hjpc pingpong]$ go run client-txtx.go
Server:  192.168.66.123:10000
Connections:  1
Bytes Received(MiB):  100
Total Time(Secs):  0.5105951040000001
Bandwidth(MiB/Sec):  195.84989988466475
2014-11-24 11:54:33 +02:00
Asias He
18565277f3 tcp: Add congestion control support
This patch adds congestion control to our TCP according to RFC5681.
These four algorithms: slow start, congestion avoidance, fast
retransmit, and fast recovery, are added.

Reviewed-by: Pekka Enberg <penberg@cloudius-systems.com>
2014-11-24 11:54:19 +02:00