scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	f458117b83	core: avoid recursion in keep_doing() Recursion takes up space on stack which takes up space in caches which means less room for useful data. In addition to that, a limit on iteration count can be larger than the limit on recursion, because we're not limited by stack size here. Also, recursion makes flame-graphs really hard to analyze because keep_doing() frames appear at different levels of nesting in the profile leading to many short "towers" instead of one big tower. This change reuses the same counter for limiting iterations as is used to limit the number of tasks executed by the reactor before polling. There was a run-time parameter added for controlling task quota.	2014-11-20 11:16:09 +02:00
Asias He	8cb9185cb6	tcp: Set retransmission timer dynamically Set RTO (retransmission timer) according to RFC6298. Now, we have a dynamic RTO istead of the hard coded 3 seconds, and an exponential back-off timer for retransmission.	2014-11-20 10:50:53 +02:00
Tomasz Grabiec	9c9f3d21bf	tests: make option defaults be effective in tests	2014-11-20 10:40:03 +02:00
Avi Kivity	f80b2a6554	hwloc: fix leaking topology object	2014-11-18 10:28:23 +02:00
Avi Kivity	d222ea6ceb	util: add defer(), a function that defers work until the end of scope	2014-11-18 10:27:51 +02:00
Asias He	817023f917	virtio: Lazy interrupts Tell host to interrupt less. This is useful for tx queue completion since we do not care much when the tx is completed exactly. Passed test with memcached and tcp_server.	2014-11-18 10:17:38 +02:00
Asias He	e386b72638	tcp: Fix ACK on closed channel In case of local: Send Data + FIN remote: Ack Data + FIN We should strip 1 byte off in data ACK only if we have sent out FIN. Otherwise, we will think there is 1 bytes that remote hasn't acked and retransmit. This patch fixes the unnecessary retransmission of the last memcache get response. Found this issue when looking at TCP flow in memaslap testing. Before: 38811 1.000117000 192.168.66.100 -> 192.168.66.123 MEMCACHE 124 get 38812 1.000593000 192.168.66.123 -> 192.168.66.100 MEMCACHE 1164 VALUE 38813 1.000624000 192.168.66.100 -> 192.168.66.123 TCP 54 59708 > 11211 [FIN, ACK] Seq=2217067730 Ack=20399459 Win=185856 Len=0 38814 1.000769000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59708 [ACK] Seq=20399459 Ack=2217067731 Win=3737600 Len=0 38815 4.000883000 192.168.66.123 -> 192.168.66.100 MEMCACHE 1164 [TCP Retransmission] VALUE 38816 4.000934000 192.168.66.100 -> 192.168.66.123 TCP 54 [TCP Dup ACK 38813#1] 59708 > 11211 [ACK] Seq=2217067731 Ack=20399459 Win=185856 Len=0 38817 4.001054000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59708 [FIN, ACK] Seq=20399459 Ack=2217067731 Win=3737600 Len=0 38818 4.001094000 192.168.66.100 -> 192.168.66.123 TCP 54 59708 > 11211 [ACK] Seq=2217067731 Ack=20399460 Win=185856 Len=0 After: 38547 1.000224000 192.168.66.100 -> 192.168.66.123 MEMCACHE 124 get 38548 1.000264000 192.168.66.123 -> 192.168.66.100 MEMCACHE 1164 VALUE 38549 1.000292000 192.168.66.100 -> 192.168.66.123 TCP 54 59717 > 11211 [FIN, ACK] Seq=1862323816 Ack=20267265 Win=185856 Len=0 38550 1.000441000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59717 [ACK] Seq=20267265 Ack=1862323817 Win=3737600 Len=0 38551 1.000602000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59717 [FIN, ACK] Seq=20267265 Ack=1862323817 Win=3737600 Len=0 38552 1.000626000 192.168.66.100 -> 192.168.66.123 TCP 54 59717 > 11211 [ACK] Seq=1862323817 Ack=20267266 Win=185856 Len=0	2014-11-18 10:17:33 +02:00
Asias He	6e9521b86b	tests: Increase bytes transfered in tx test From 10MiB to 100MiB, stress more.	2014-11-18 10:16:38 +02:00
Asias He	ee023f4f84	tcp: Fix delayed ack When doing tcp rx testing, I saw a lot of retransmission because of the delayed ACK. Our current delayed ACK algorithm does not comply with what RFC 1122 suggests. As described in RFC 1122, a host may delay sending an ACK response by up to 500 ms. Additionally, with a stream of full-sized incoming segments, ACK responses must be sent for every second segment. === Before === [asias@hjpc pingpong]$ go run client-rxrx.go Bytes Sent(MiB): 100 Total Time(Secs): 322.620879376 Bandwidth(MiB/Sec): 0.30996133974160595 78 2.412385 192.168.66.100 -> 192.168.66.123 TCP 32174 37672 > 10000 [ACK] Seq=2149425323 Ack=1000001 Win=229 Len=32120 79 2.612985 192.168.66.100 -> 192.168.66.123 TCP 1514 [TCP Retransmission] 37672 > 10000 [ACK] Seq=2149425323 Ack=1000001 Win=229 Len=1460 80 2.613131 192.168.66.123 -> 192.168.66.100 TCP 54 10000 > 37672 [ACK] Seq=1000001 Ack=2149457443 Win=29200 Len=0 === After === [asias@hjpc pingpong]$ go run client-rxrx.go Bytes Sent(MiB): 100 Total Time(Secs): 0.244951095 Bandwidth(MiB/Sec): 408.2447559583271 No retransmission is seen.	2014-11-17 11:50:51 +02:00
Avi Kivity	8e47ed8b06	tests: whitelist allocator_test	2014-11-15 12:19:37 -08:00
Tomasz Grabiec	05d89f1ab9	tests: add output_stream_test	2014-11-15 12:11:11 -08:00
Tomasz Grabiec	b8344e31e0	output_stream: coalesce large buffers with data already in the buffer Assuming the output_stream size is set to 8K, a sequence of writes of lengths: 128B, 8K, 128B would yield three fragments of exactly those sizes. This is not optimal as one could fit those in just 2 fragments of up to 8K size. This change makes the output_stream yield 8K and 256B fragments for this case.	2014-11-15 11:58:10 -08:00
Tomasz Grabiec	b1208d6501	output_stream: simplify flush() output_stream can be used by only one fiber at a time so from correctness point of view it doesn't matter if we set _end before or after put(), but setting it before it allows us to have one future less, which is a win.	2014-11-15 11:58:09 -08:00
Tomasz Grabiec	825b3608a4	tests: configure reactor for tests Commit `405f3ea8c3` changed reactor so that _network_stack is no longer default initialized to POSIX but to nullptr. This caused tests to segfault, becayse they are not using application template which takes care of configuration. The fix is to call configure() so that netwrok stack will be set to POSIX.	2014-11-15 11:58:07 -08:00
Avi Kivity	c52c56ce7b	tests: add memory allocation test	2014-11-15 11:56:16 -08:00
Avi Kivity	1a7fd983ac	memory: fix buffer overrun We store spans in freelist i if the span's size >= 2^i. However, when picking a span to satisfy an allocation, we must use the next larger list if the size is not a power of two, so that we can be sure that all spans on that list can satisfy that request. The current code doesn't do that, so it under-allocates, leading to memory corruption.	2014-11-15 11:52:39 -08:00
Nadav Har'El	5b24dd78e2	virtio: don't use file eventfd for OSv notifications Now that our reactor supports non-file-descriptor notification mechanisms, switch to using one instead of eventfd when notifying of virtio interrupts. This will allow us to change the OSv enable_interrupt() code to run the handler directly, not in a separate thread, because it no longer needs to do sleepable write() to an eventfd file descriptor. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2014-11-13 22:24:38 +02:00
Tomasz Grabiec	c262060d92	memcache: avoid vprintf() Improves memaslap UDP posix throughput on my laptop by 40% (from 73k to 105k). When item is created we cache flags and size part of the response so that there's no need to call expensive string formatting in get(). The down side is that this pollutes "item" object with protocol-specific field, but since ASCII is the only protocol which is supported now and it's not like we can't fix it later, I think it's fine.	2014-11-13 22:22:07 +02:00
Tomasz Grabiec	627e14c2e4	sstring: introduce make_sstring() It concatenates multiple string-like entities in one go and gives away an sstring. It does at most one allocation for the final sstring and one copy per each string. Works with heterogenous arguments, both sstrings and constant strings are supported, string_views are planned.	2014-11-13 22:22:05 +02:00
Tomasz Grabiec	42b20cdad1	test.py: print output from test on error	2014-11-13 22:22:01 +02:00
Nadav Har'El	405f3ea8c3	reactor: refactor main loop for epoll and OSv The reactor is currently designed around the concept of file descriptors and polling them. Every source of events is a file descriptor, and those which are not, like timers, signals and inter-thread notifications, are "converted" to file-descriptor events using timerfd, signalfd and eventfd respectively. But for running OSv with a directly assigned virtio device, we don't want to use file descriptors for notifications: When we need each interrupt to signal an eventfd, this is slow, and also problematic because file descriptors contain locks so we can't signal an eventfd at interrupt time, causing the existing code to use an extra thread to do this. So this patch refactors the reactor to allow the main loop to be based no just on file descriptors, but on a different type of abstractions. We have a reactor_backend (with epoll and osv implementation), to which we We don't add "file descriptors" but rather more abstract notions like timer, signal or "notifier" (similar to eventfd). The Linux epoll implementation indeed uses file descriptors internally (with timer using a timerfd, signal using signalfd and notifier using eventfd) but the OSv implementation does not use file descriptors. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2014-11-12 18:15:59 +02:00
Calle Wilund	bfbdbdf29c	dhcp: fix assert/crash in DHCP renew cycle. Must not signal "_config" promise on renew. Also not needed. Signed-off-by: Calle Wilund <calle@cloudius-systems.com>	2014-11-11 14:04:00 +02:00
Avi Kivity	067112a319	Merge branch 'tgrabiec/smp' From Tomasz: "There will be now a separate DB per core, each serving a subset of the key space (sharding). From the outside in appears to behave as one DB."	2014-11-11 13:52:59 +02:00
Tomasz Grabiec	6913079927	tests: memcache: do not constrain tests to 1 CPU	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	b0dd9e736c	memcached: SMP support There is a separate DB per core, each serving a subset of the key space. From the outside in appears to behave as one DB. item_key type was changed to include the hash so that we calculate the hash only once. The same hash is used for sharding and hashing. No need for store_hash<> option on unordered_set<> any more. Some seastar-specific and hashtable-specific stats were moved from the general "stats" command into "stats hash", which shows per-core statistics.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	a82b2beb32	core: add shutdown hook registration facility Use like this: engine.at_exit([] { std::cout << "so long!\n"; return make_ready_future<>(); }); All lambdas will be executed when reactor is stopped, in order, on the same CPU on which they were registerred.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	95e09be799	net: add has_per_core_namespace() attribute to network stack POSIX stack does not allow one to bind more than one socket to given port. Native stack on the other hand does. The way services are set up depends on that. For instance, on native stack one might want to start the service on all cores, but on POSIX stack only on one of them.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	b647bb5746	smp: introduce distributed::start_single() Services which create UDP sockets on the same port on POSIX stack can have only one instance. This decision needs to be made at run-time.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	618cbd5729	smp: introduce foreign_ptr<> A smart pointer wrapper which deletes the pointer on the CPU on which it was wrapped.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	0b4ee2ff60	core: advertise element type in shared_ptr<> Other smart pointers also do that. Will help foreign_ptr<>.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	a77ecbeeef	smp: introduce distributed::invoke_on_all() overload for void-returning functions	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	c71f762f59	smp: introduce distributed::local()	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	79982a8545	smp: add distributed::invoke_on() overload for void-returning functions	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	8bbe285004	smp: improve forwarding of arguments in distributed::invoke_on() It is now capable of moving r-values rather than copying them.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	1988748885	smp: introduce distributed::map_reduce()	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	7e25d70392	core: introduce map_reduce() utility It spawns async mapping action in parallel and reduces the results as they come.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	6df3a03c0a	core: make submit_to() accept functions which return non-futures This adds an overload which will automatically wrap non-future non-void result in a ready future. Pro: less boiler plate code at call sites.	2014-11-11 13:52:23 +02:00
Tomasz Grabiec	c2fbfe8e84	core: destroy network stack before destroying timer lists. Fixes assert failure during ^C: #0 0x0000003e134348c7 in raise () from /lib64/libc.so.6 #1 0x0000003e1343652a in abort () from /lib64/libc.so.6 #2 0x0000003e1342d46d in __assert_fail_base () from /lib64/libc.so.6 #3 0x0000003e1342d522 in __assert_fail () from /lib64/libc.so.6 #4 0x0000000000409a7c in boost::intrusive::list_impl<boost::intrusive::mhtraits<timer, boost::intrusive::list_ at /usr/include/boost/intrusive/list.hpp:1263 #5 0x00000000004881cc in iterator_to (this=<optimized out>, value=...) at core/timer-set.hh:71 #6 reactor::del_timer (this=<optimized out>, tmr=tmr@entry=0x60000005cda8) at core/reactor.cc:287 #7 0x00000000004682a5 in ~timer (this=0x60000005cda8, __in_chrg=<optimized out>) at ./core/reactor.hh:974 #8 ~resolution (this=0x60000005cd90, __in_chrg=<optimized out>) at net/arp.hh:86 #9 ~pair (this=0x60000005cd88, __in_chrg=<optimized out>) at /usr/include/c++/4.9.2/bits/stl_pair.h:96	2014-11-11 13:52:23 +02:00
Calle Wilund	c3ba7a73bb	dhcp: actually ensure that packets are processed on cpu 0 Previous code (or lack thereof) hoped to achieve this. Not quite successfully. Signed-off-by: Calle Wilund <calle@cloudius-systems.com>	2014-11-10 17:09:27 +02:00
Nadav Har'El	63fb31a8be	README: another missing package We use "-lpciaccess", so need to install libpciaccess-dev on Ubuntu Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2014-11-10 16:39:15 +02:00
Nadav Har'El	4298ad2a3c	README: explain how to install missing pieces on Ubuntu 12.04 Say which prerequisites to install on Ubuntu 12.04, and how to set up gcc 4.9 side-by-side with the existing gcc 4.8 (without harming the existing gcc 4.8 installation). Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2014-11-10 16:06:46 +02:00
Gleb Natapov	c908d5508e	smp: do not reorder tasks submitted to smp queue Currently semaphore is used to keep track off free space in smp queue, but our semaphore does not guaranty that order in which tasks call wait() will be the same order they will get access to a resource. This may cause packet reordering in smp which is not desirable for TCP performance. This patch replaces the semaphore with a simple counter and another queue to hold items that cannot be places into smp queue due to lack of space.	2014-11-10 15:58:48 +02:00
Asias He	e2b1186cca	net: Add more tcp and ip header const net::tcp_hdr_len_min net::ipv4_hdr_len_min net::ipv6_hdr_len_min InetTraits::ip_hdr_len_min is added to handle both ipv4 and ipv6.	2014-11-10 10:17:49 +02:00
Asias He	7260d7b9de	tcp: Out of order input support Tested with emulated packet reordering using tc and tcp_server rx test: sudo tc qdisc add dev tap0 root netem delay 100ms reorder 25% 50%	2014-11-10 10:01:06 +02:00
Asias He	ead391491d	net: Add rx test in tcp_server	2014-11-10 10:01:05 +02:00
Gleb Natapov	2a56c52fcb	net: distribute udp packets according to address pair	2014-11-09 18:17:54 +02:00
Gleb Natapov	c64e1e27fb	net: move connid out of tcp to be reused for udp	2014-11-09 18:17:44 +02:00
Gleb Natapov	25da340e07	net: remove rx feedback from proxy net device `99941f0c16` did that for virtio, do the same for proxy here.	2014-11-09 18:07:14 +02:00
Gleb Natapov	136a56859f	net: limit the number of packets that are waiting to be sent to another cpu If packet arrive faster than they can be forwarded we can run out of memory.	2014-11-09 18:06:22 +02:00
Nadav Har'El	fcce304908	collectd: Don't use the network stack before it is set up The current code (this will change soon with my reactor patches) constructs a default (Posix) network stack before reactore::configure() reassigns it to the requested network stack. It turns out there is one place we use the network stack before calling reactore::configure(), which ends up using the Posix stack even though we want the native stack - this is both silly and plainly doesn't work on the OSv setup. The problem is that app_template.hh tries to configure scollectd before the engine is started. This calls scollectd::impl::start() which calls engine.net().make_udp_channel(). When this happens this early, it creates a Posix socket... This patch moves the scollectd configuration to after the engine is started. It makes sense to me: As far as I understand, scollectd is all about sending packets (diagnostic packets), and it's kind of silly to start sending packets before starting the machinary which allows us to send packets. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com> [avi: use customary indentation, remove unneeded make_ready_future()]	2014-11-09 17:46:09 +02:00

1 2 3 4 5 ...

792 Commits