scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 03:20:37 +00:00

Author	SHA1	Message	Date
Asias He	8cb9185cb6	tcp: Set retransmission timer dynamically Set RTO (retransmission timer) according to RFC6298. Now, we have a dynamic RTO istead of the hard coded 3 seconds, and an exponential back-off timer for retransmission.	2014-11-20 10:50:53 +02:00
Asias He	817023f917	virtio: Lazy interrupts Tell host to interrupt less. This is useful for tx queue completion since we do not care much when the tx is completed exactly. Passed test with memcached and tcp_server.	2014-11-18 10:17:38 +02:00
Asias He	e386b72638	tcp: Fix ACK on closed channel In case of local: Send Data + FIN remote: Ack Data + FIN We should strip 1 byte off in data ACK only if we have sent out FIN. Otherwise, we will think there is 1 bytes that remote hasn't acked and retransmit. This patch fixes the unnecessary retransmission of the last memcache get response. Found this issue when looking at TCP flow in memaslap testing. Before: 38811 1.000117000 192.168.66.100 -> 192.168.66.123 MEMCACHE 124 get 38812 1.000593000 192.168.66.123 -> 192.168.66.100 MEMCACHE 1164 VALUE 38813 1.000624000 192.168.66.100 -> 192.168.66.123 TCP 54 59708 > 11211 [FIN, ACK] Seq=2217067730 Ack=20399459 Win=185856 Len=0 38814 1.000769000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59708 [ACK] Seq=20399459 Ack=2217067731 Win=3737600 Len=0 38815 4.000883000 192.168.66.123 -> 192.168.66.100 MEMCACHE 1164 [TCP Retransmission] VALUE 38816 4.000934000 192.168.66.100 -> 192.168.66.123 TCP 54 [TCP Dup ACK 38813#1] 59708 > 11211 [ACK] Seq=2217067731 Ack=20399459 Win=185856 Len=0 38817 4.001054000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59708 [FIN, ACK] Seq=20399459 Ack=2217067731 Win=3737600 Len=0 38818 4.001094000 192.168.66.100 -> 192.168.66.123 TCP 54 59708 > 11211 [ACK] Seq=2217067731 Ack=20399460 Win=185856 Len=0 After: 38547 1.000224000 192.168.66.100 -> 192.168.66.123 MEMCACHE 124 get 38548 1.000264000 192.168.66.123 -> 192.168.66.100 MEMCACHE 1164 VALUE 38549 1.000292000 192.168.66.100 -> 192.168.66.123 TCP 54 59717 > 11211 [FIN, ACK] Seq=1862323816 Ack=20267265 Win=185856 Len=0 38550 1.000441000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59717 [ACK] Seq=20267265 Ack=1862323817 Win=3737600 Len=0 38551 1.000602000 192.168.66.123 -> 192.168.66.100 TCP 54 11211 > 59717 [FIN, ACK] Seq=20267265 Ack=1862323817 Win=3737600 Len=0 38552 1.000626000 192.168.66.100 -> 192.168.66.123 TCP 54 59717 > 11211 [ACK] Seq=1862323817 Ack=20267266 Win=185856 Len=0	2014-11-18 10:17:33 +02:00
Asias He	ee023f4f84	tcp: Fix delayed ack When doing tcp rx testing, I saw a lot of retransmission because of the delayed ACK. Our current delayed ACK algorithm does not comply with what RFC 1122 suggests. As described in RFC 1122, a host may delay sending an ACK response by up to 500 ms. Additionally, with a stream of full-sized incoming segments, ACK responses must be sent for every second segment. === Before === [asias@hjpc pingpong]$ go run client-rxrx.go Bytes Sent(MiB): 100 Total Time(Secs): 322.620879376 Bandwidth(MiB/Sec): 0.30996133974160595 78 2.412385 192.168.66.100 -> 192.168.66.123 TCP 32174 37672 > 10000 [ACK] Seq=2149425323 Ack=1000001 Win=229 Len=32120 79 2.612985 192.168.66.100 -> 192.168.66.123 TCP 1514 [TCP Retransmission] 37672 > 10000 [ACK] Seq=2149425323 Ack=1000001 Win=229 Len=1460 80 2.613131 192.168.66.123 -> 192.168.66.100 TCP 54 10000 > 37672 [ACK] Seq=1000001 Ack=2149457443 Win=29200 Len=0 === After === [asias@hjpc pingpong]$ go run client-rxrx.go Bytes Sent(MiB): 100 Total Time(Secs): 0.244951095 Bandwidth(MiB/Sec): 408.2447559583271 No retransmission is seen.	2014-11-17 11:50:51 +02:00
Nadav Har'El	5b24dd78e2	virtio: don't use file eventfd for OSv notifications Now that our reactor supports non-file-descriptor notification mechanisms, switch to using one instead of eventfd when notifying of virtio interrupts. This will allow us to change the OSv enable_interrupt() code to run the handler directly, not in a separate thread, because it no longer needs to do sleepable write() to an eventfd file descriptor. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2014-11-13 22:24:38 +02:00
Calle Wilund	bfbdbdf29c	dhcp: fix assert/crash in DHCP renew cycle. Must not signal "_config" promise on renew. Also not needed. Signed-off-by: Calle Wilund <calle@cloudius-systems.com>	2014-11-11 14:04:00 +02:00
Avi Kivity	067112a319	Merge branch 'tgrabiec/smp' From Tomasz: "There will be now a separate DB per core, each serving a subset of the key space (sharding). From the outside in appears to behave as one DB."	2014-11-11 13:52:59 +02:00
Tomasz Grabiec	95e09be799	net: add has_per_core_namespace() attribute to network stack POSIX stack does not allow one to bind more than one socket to given port. Native stack on the other hand does. The way services are set up depends on that. For instance, on native stack one might want to start the service on all cores, but on POSIX stack only on one of them.	2014-11-11 13:52:23 +02:00
Calle Wilund	c3ba7a73bb	dhcp: actually ensure that packets are processed on cpu 0 Previous code (or lack thereof) hoped to achieve this. Not quite successfully. Signed-off-by: Calle Wilund <calle@cloudius-systems.com>	2014-11-10 17:09:27 +02:00
Asias He	e2b1186cca	net: Add more tcp and ip header const net::tcp_hdr_len_min net::ipv4_hdr_len_min net::ipv6_hdr_len_min InetTraits::ip_hdr_len_min is added to handle both ipv4 and ipv6.	2014-11-10 10:17:49 +02:00
Asias He	7260d7b9de	tcp: Out of order input support Tested with emulated packet reordering using tc and tcp_server rx test: sudo tc qdisc add dev tap0 root netem delay 100ms reorder 25% 50%	2014-11-10 10:01:06 +02:00
Gleb Natapov	2a56c52fcb	net: distribute udp packets according to address pair	2014-11-09 18:17:54 +02:00
Gleb Natapov	c64e1e27fb	net: move connid out of tcp to be reused for udp	2014-11-09 18:17:44 +02:00
Gleb Natapov	25da340e07	net: remove rx feedback from proxy net device `99941f0c16` did that for virtio, do the same for proxy here.	2014-11-09 18:07:14 +02:00
Gleb Natapov	136a56859f	net: limit the number of packets that are waiting to be sent to another cpu If packet arrive faster than they can be forwarded we can run out of memory.	2014-11-09 18:06:22 +02:00
Tomasz Grabiec	761d6119ef	posix: simplify uses of setsockopt	2014-11-09 16:33:33 +02:00
Avi Kivity	f265fe5ecd	xen: allow disabling the split-event-channel feature for debugging	2014-11-09 16:19:37 +02:00
Avi Kivity	59a7eeeea0	dhcp: retry Some bridges delay forwarding until some time has passed, which requires DHCP retries.	2014-11-09 16:13:25 +02:00
Avi Kivity	adc97c0162	dhcp: filter out DHCP failures If we don't, we start the system before we have an IP address, and when we actually do get the IP address, we fail an assert on the _config promise, which was already fulfilled.	2014-11-09 15:03:07 +02:00
Avi Kivity	5bb13601fe	xen: wrap in "xen" namespace Names like "port" are too generic for the global namespace.	2014-11-09 14:41:01 +02:00
Avi Kivity	14968812fe	xen: remove port::operator int() It's dangerous as it can be invoked in unexpected places.	2014-11-09 14:34:25 +02:00
Avi Kivity	46aac42704	xen: make 'port' a value object Makes it easier of users to manage its lifetime.	2014-11-09 13:30:52 +02:00
Avi Kivity	8857412365	xen: fix explicitly-disabled split event channel feature In case the hypervisor supports the split event channel feature, but advertises it as disabled, we must not assume it works.	2014-11-09 12:08:49 +02:00
Glauber Costa	ab3d02e347	xen: use a port class instead of an integer to represent an event channel The representation of an event channel as an integer poses a problem, in which waiting on an integer port doesn't work well when the same event channel is assigned for both tx and rx. The future will be ready for one of the sides, but we won't process the other. One alternative is to have conditions in the future processing, and in case the event channels are bound to the same port, process both events. But a better solution is to use a class to represent the bound ports, and instances of those classes will have their own pending methods. Infrastructure will be written in a following patch to make sure that all listeners to the same port will be made ready when an interrupt kicks in Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-09 11:54:08 +02:00
Glauber Costa	9cd9fda570	xen: don't take split feature for granted The backend may be completely silent about the existence of the split channels feature. In that case, trying to read through the template directly would cause an exception, since we can't convert the empty string. The backend-id, OTOH, is guaranteed to exist and wasn't using the template signature. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-09 11:54:06 +02:00
Avi Kivity	467ff9cbe2	xen: fix grant refernce leaks We copy our grant reference into a temporary, so free_ref() does not clear the real entry, causing an assert() to trigger later on. Fix by capturing the grant reference entry by reference. With this, the xen network driver survives multiple trips around the ring.	2014-11-07 14:26:00 +02:00
Avi Kivity	ee7ec972eb	xen: request tx notification on tx completion Otherwise, we never learn that transmission has completed and never recycle ring entries. This is still a little lame as we don't do any batching.	2014-11-07 14:26:00 +02:00
Avi Kivity	8a29d4a78a	xen: replenish rx ring entries Keep recycling free ring entries back into the receive ring so we can receive more than 256 packets. The code is a little lame at the moment since it writes the index and notifies the host for every frame, but that can be adjusted later.	2014-11-07 14:25:58 +02:00
Avi Kivity	8dff32eea5	xen: simplify free grant table ref id management There is no reason to wait when pushing back a free id - there is nothing that could possibly block there. Switch from a queue<> to an std::queue<> and use a semaphore to guard popping from the queue.	2014-11-07 13:09:24 +02:00
Asias He	dbfd636a0b	net: Fix proxy_net_device option parse Running tcp stream test with --smp > 1, sometimes the server sends TSO frame, sometimes it does not. If we set --smp = 1, the server always sends TSO frame. This is because the proxy device does not parse all the features in the opts. We should copy the _hw_features from the real device but it is not easy. For now, we simply duplicate the parse code.	2014-11-07 11:17:33 +02:00
Asias He	ff674d3e0e	tcp: Avoid unnecessary ACK E.g. Avoid Dup ACK in packet #981 979 4.115432000 192.168.66.123 -> 192.168.66.100 TCP 20406 [TCP Window Full] 10000 > 50112 [ACK] Seq=10675905 Ack=801512443 Win=3737600 Len=20352 980 4.119002000 192.168.66.100 -> 192.168.66.123 TCP 54 [TCP ZeroWindow] 50112 > 10000 [ACK] Seq=801512443 Ack=10696257 Win=0 Len=0 981 4.119063000 192.168.66.123 -> 192.168.66.100 TCP 54 [TCP Dup ACK 979#1] 10000 > 50112 [ACK] Seq=10696257 Ack=801512443 Win=3737600 Len=0 982 4.137244000 192.168.66.100 -> 192.168.66.123 TCP 54 [TCP Window Update] 50112 > 10000 [ACK] Seq=801512443 Ack=10696257 Win=40704 Len=0	2014-11-07 11:17:33 +02:00
Asias He	5b994fb4f0	tcp: Fix _data_received_promise and _all_data_acked_promise We should clear it right after we set value, otherwise we might set value more than once.	2014-11-07 11:17:31 +02:00
Asias He	1e40660248	net: Switch to optional for _data_received	2014-11-06 14:50:16 +02:00
Asias He	14130ab1e8	net: Fix TCP sending of bulk data Fix tcp_server tx test. We still have more to do. Native stack: $ go run client-txtx.go Bytes Received(MiB): 1000 Total Time(Secs): 1.567927562 Bandwidth(MiB/Sec): 637.7845662234746 Posix stack: $ go run client-txtx.go Bytes Received(MiB): 1000 Total Time(Secs): 1.014354958 Bandwidth(MiB/Sec): 985.8481906291427 Note: client-txtx uses 100 concurrent connections.	2014-11-06 14:50:12 +02:00
Asias He	2a582fd1a6	net: fix tso maximum packet size With TSO enabled, we can see a Ethernet frame larger than 64K on tap device. This makes wireshark unable to handle. It complains: The capture file appears to be damaged or corrupt. (pcapng_read_packet_block: cap_len 65549 is larger than WTAP_MAX_PACKET_SIZE 65535.)	2014-11-06 14:50:11 +02:00
Avi Kivity	4df81e0fba	Merge branch 'glommer/xen' of github.com:cloudius-systems/seastar-dev From Glauber: "This is all the xen work I have. There is still improvements to be made with the ring management, memory allocation, and other areas."	2014-11-06 12:45:30 +02:00
Glauber Costa	6c0aaa126c	xen: grant recycle handle buffer recycles. Right now it is very simple: allocate a new receive buffer after a succesful receival, and mark the tx spot free when we get the tx event notification. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:22:15 +01:00
Glauber Costa	0a1f5f9e73	xen: defer grant table operations Instead of returning a reference to a grant that is already present in an array, defer the initialization. This is how the OSv driver handles it, and I honestly am not sure if this is really needed: it seems to me we should be able to just reuse the old grants. I need to check in the backend code if we can be any smarter than this. However, right now we need to do something to recycle the buffers, and just re-doing the refs would lead to inconsistencies. So the best by now is to close and reopen the grants, and then later on rework this in a way that works for both the initial setup and the recycle. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:30 +01:00
Glauber Costa	722926d545	xen: factor out allocation of a single rx entry I'll need this code later to refill the buffer, so factor this out Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:30 +01:00
Glauber Costa	01c861fba4	xen: don't increment producer index in receive path Right now, we allocate the whole index, and notify the backend that we have produced nr_ents indexes. If we do that, we cannot increment the producer index when we receive a new package. This would make the index overflow, and basically, it is the responsible for the biggest part of the slowdown we are seeing. Before this patch, we're seeing 2s RTT for pings. After the patch: 64 bytes from 192.168.100.79: icmp_seq=1 ttl=64 time=0.437 ms 64 bytes from 192.168.100.79: icmp_seq=2 ttl=64 time=0.431 ms 64 bytes from 192.168.100.79: icmp_seq=3 ttl=64 time=0.475 ms Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:29 +01:00
Glauber Costa	ae1122bfc8	xen: manage index list Aside from managing the grant references, we also need to manage the positional indexes in the array. We need to keep track of which indexes are free, and which are used. Because we need the actual position number to fill xen's data structures, I figured we could use a queue and then fill it up with all the integers in our range. The queue is already futurized, so that's easy. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:29 +01:00
Gleb Natapov	d698811bdd	fix smp broadcast packet handling Some packets, like arp replies, are broadcast to all cpus for handling, but only packet structure is copied for each cpu, the actual packet data is the same for all of them. Currently networking stack mangles a packet data during its travel up the stack while doing ntoh() translations which cannot obviously work for broadcaster packets. This patches fixes the code to not modify packet data while doing ntoh(), but do it in a stack allocated copy of a data instead.	2014-11-06 10:30:30 +02:00
Pekka Enberg	86aa399482	net: Fix build when Xen support is disabled Fixes the following link errors when Xen support is disabled: build/release/net/native-stack.o: In function `net::add_native_net_options_description(boost::program_options::options_description&)': /seastar/net/native-stack.cc:101: undefined reference to `get_xenfront_net_options_description()' build/release/net/native-stack.o: In function `net::create_native_net_device(boost::program_options::variables_map)': /seastar/net/native-stack.cc:93: undefined reference to `create_xenfront_net_device(boost::program_options::variables_map, bool)' collect2: error: ld returned 1 exit status Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2014-11-06 10:24:03 +02:00
Glauber Costa	63c8db870f	xen: remove debug printfs As packet flow is working reasonably now, most of the prints can go. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-05 22:30:25 +01:00
Glauber Costa	73b8f98318	xen: use nr_ents instead of numeric constant in netfront header Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-05 21:41:52 +01:00
Avi Kivity	5052d34d23	Merge branch 'xen' Partial Xen support.	2014-11-05 15:31:23 +02:00
Avi Kivity	369f31d4c5	xen: simplify front_ring constructor	2014-11-05 15:09:04 +02:00
Avi Kivity	2d14053e6e	xen: make gntref more readable Convert it from std::pair with meaningless .first and .second fields to a proper struct.	2014-11-05 15:09:04 +02:00
Avi Kivity	0a0dc6eb90	xen: provide correct checksum offload flags to the host Tell Xen when we've computed the checksum ourselves, and when we have a partial checksum filled.	2014-11-05 15:09:04 +02:00
Avi Kivity	c52b4fdc47	xen: partial support for checksum offload Checksum offload cannot be disabled in Xen (or at least, I haven't figured out how). Advertise it as enabled, so that tcp doesn't drop packets as failing their checksum. Still need to flesh out the transmit path. With this, seastar sends SYN/ACK packets in response to connection requests.	2014-11-05 15:09:04 +02:00

1 2 3 4 5 ...

266 Commits