scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Avi Kivity	4c59fe6436	xen: ensure _xenstore member is initialized early enough Thanks clang.	2015-01-08 18:44:23 +02:00
Gleb Natapov	13c1324d45	net: provide some statistics via collectd Provide batching and overall send/received packet stats.	2015-01-08 17:41:26 +02:00
Gleb Natapov	fbef83beb0	net: support for num of cpus > num of queues This patch introduce a logic to divide cpus between available hw queue pairs. Each cpu with hw qp gets a set of cpus to distribute traffic to. The algorithm doesn't take any topology considerations into account yet.	2014-12-16 10:53:41 +02:00
Gleb Natapov	649210b5b6	net: rename net::distributed_device to net::device	2014-12-11 13:06:32 +02:00
Gleb Natapov	0e70ba69cf	net: rename net::device to net::qp	2014-12-11 13:06:27 +02:00
Nadav Har'El	3d874892a7	dpdk: enable transmit-side checksumming offload This patch uses the NIC's capability to calculate in hardware the IP, TCP and UDP checksums on outgoing packets, instead of us doing this on the sending CPU. This can save us quite a bit of calculations (especially for the TCP/UDP checksum of full-sized packets), and avoid cache-polution on the CPU when sending cold data. On my setup this patch improves the performance of a single-cpu memcached by 6%. Together with the recent patch for receive-side checksum offloading, the total improvement is 10%. This patch is somewhat complicated by the fact we have so many different combinations of checksum-offloading capabilities; While virtio can only offload layer-4 checksumming (tcp/udp), dpdk lets us offload both ip and layer-4 checksum. Moreover, some packets are just IP but not TCP/UDP (e.g., ICMP), and some packets are not even IP (e.g., ARP), so this patch modifies a few of the hardware-features flags and the per-packet offload-information flags to fit our new needs. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2014-12-10 18:05:02 +02:00
Gleb Natapov	73f6d943e1	net: separate device initialization from queues initialization This patch adds new class distributed_device which is responsible for initializing HW device and it is shared between all cpus. Old device class responsibility becomes managing rx/tx queue pair and it is local per cpu. Each cpu have to call distributed_device::init_local_queue() to create its own device. The logic to distribute cpus between available queues (in case there is no enough queues for each cpu) is in the distributed_device currently and not really implemented yet, so only one queue or queues == cpus scenarios are supported currently, but this can be fixed later. The plan is to rename "distributed_device" to "device" and "device" to "queue_pair" in later patches.	2014-12-09 18:55:14 +02:00
Gleb Natapov	7dbc333da6	core: Allow forwarding from/to any cpu	2014-12-03 17:47:29 +08:00
Gleb Natapov	bf46f9c948	net: Change how networking devices are created Currently each cpu creates network device as part of native networking stack creation and all cpus create native networking stack independently, which makes it impossible to use data initialized by one cpu in another cpu's networking device initialization. For multiqueue devices often some parts of an initialization have to be handled by one cpu and all other cpus should wait for the first one before creating their network devices. Even without multiqueue proxy devices should be created after master device is created so that proxy device may get a pointer to the master at creation time (existing code uses global per cpu device pointer and assume that master device is created on cpu 0 to compensate for the lack of ordering). This patch makes it possible to delay native networking stack creation until network device is created. It allows one cpu to be responsible for creation of network devices on multiple cpus. Single queue device initialize master device on one cpu and call other cpus with a pointer to master device and its cpu id which are used in proxy device creation. This removes the need for per cpu device pointer and "master on cpu 0" assumption from the code since now master device and slave devices know about each other and can communicate directly.	2014-11-30 18:10:08 +02:00
Glauber Costa	b3c163e603	xen: fix typo in event channel detection Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-28 14:15:36 +01:00
Glauber Costa	c3ae30b760	xen: delete event channel as well If we don't have split channels, we need to delete the relevant property. because xs_rm() returns true if the feature does not exist, it won't affect the transaction if we just delete all of them. Therefore we don't need to do any conditional test. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-27 18:00:35 +01:00
Glauber Costa	3848130f2f	xen: only add features to feature array We are adding everything we read into the features array. Because in the destructor we will remove everything in the features list, we'll end up removing more than we should. Things like the mac address, handle, etc, should never be deleted. This is not a problem for OSv because usually, after the destructor is called, the whole guest is down. But for userspace, the network card is left there, but will cease to work if we delete too much. After we do that with the _features array - it's original intent, it becomes reduntant with features nack. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-27 18:00:35 +01:00
Glauber Costa	bd8a18c178	xen: umask event channels when setup is ready This is not required for OSv, but is required for userspace operation. It won't work without it. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-27 18:00:35 +01:00
Glauber Costa	b56a89d5c9	xen: translate feature name When the backend advertises "feature-rx-copy", the frontend should register for "request-rx-copy". The local hypervisor seems to be forgiving about it, but the one in AWS, it is not, and doubly so. First, it doesn't recognize these as the same. And second, it refuses to connect the backend if this feature is not advertised by the frontend. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-26 17:22:58 +01:00
Glauber Costa	a9a79e3ba6	xen: ring unification The ring processing is almost the same for both rx and tx, with the exception with the core of the action. We can actually unify them nicely with some use of template programming. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-26 17:21:09 +01:00
Glauber Costa	e7c9aeb8a5	xen: interrupt mitigation There are two things we can do that will lead to less interrupts being sent. The first, is to read the new rsp_cons value at the end of every interaction. If the backend produces more frames in the mean time, we'll be able to process in the same round, without getting another interrupt. The other, is to set the rsp_event only after all the frames are processed. As a matter of fact, both the tx and rx rings did one of them, but not the same one. The next patch will unify the ring code to avoid problems like that in the future. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-26 17:17:45 +01:00
Glauber Costa	dd8c5a3521	xen: fix index calculation The xen protocol needs works by filling positions in a circular ring. The indexes become free to be used again when they are processed by the other side. There is a problem, however: those indexes must be sequential, because all the sides share is a produced / consumed index. But there are situations in which we call get_index() - which produces an index X, but the .then() clause schedules some other caller of send() to run in our place. That one, in turn, can call get_index(), then create a packet with index X + 1 that will be put in the ring before the packet with index X. If the other end processes this packet very fast, it will respond saying "I have processed packets up to X + 1". We will act on it as marking X as processed as well - since it comes before X + 1, and when X is really processed, chaos will ensue. The solution for that is to just have the semaphore to count how many spaces we have in the ring. Once we guarantee that the current caller have space, we then compute get_index() inside the .then() clause. This works well because the indexes are all sequential anyway. For the same reason, we are actually able to remove the queue, and resort to a simple counter. Once we know there is room, we just get the next index, whatever it may be. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-24 17:01:14 +01:00
Glauber Costa	3c195d25e6	xen: useful assert we can't reach this place with a negative ref id, so let's assert to make sure we're fine. Help catching some bugs. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-24 16:42:55 +01:00
Glauber Costa	fa252087c4	xen: use the right index The index in the ring and the packet id tends to be the same. But it doesn't have to. There are some situations where the backend and the frontend get out of sync with this, and this is totally valid. One example is when the backend skb already have enough room to hold all of the data being transmitted (netback.c, line 1611 @3.16). The netback will respond immediately, even though there are other pending packets that are not yet fully processed. The ring index, then, must come from the rsp value, not from the req/rsp id. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-24 16:38:38 +01:00
Glauber Costa	5f82f7296f	xen: debug functions This is being very helpful for my local debugging. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-20 16:57:16 +01:00
Avi Kivity	57fdff17ff	xen: fix non-split event channel xenbus notification The xenbus name for non-split event channels is different.	2014-11-20 12:25:13 +02:00
Avi Kivity	62693f884f	xen: bind event channels after determining split event channel support	2014-11-20 12:25:13 +02:00
Avi Kivity	3118c5b464	xen: positively acknowledge supported features Instead of just nacking unsupported features, respond to all features, with a NACK for unsupported ones and ACK for supported ones.	2014-11-20 12:25:13 +02:00
Avi Kivity	f265fe5ecd	xen: allow disabling the split-event-channel feature for debugging	2014-11-09 16:19:37 +02:00
Avi Kivity	5bb13601fe	xen: wrap in "xen" namespace Names like "port" are too generic for the global namespace.	2014-11-09 14:41:01 +02:00
Avi Kivity	14968812fe	xen: remove port::operator int() It's dangerous as it can be invoked in unexpected places.	2014-11-09 14:34:25 +02:00
Avi Kivity	46aac42704	xen: make 'port' a value object Makes it easier of users to manage its lifetime.	2014-11-09 13:30:52 +02:00
Avi Kivity	8857412365	xen: fix explicitly-disabled split event channel feature In case the hypervisor supports the split event channel feature, but advertises it as disabled, we must not assume it works.	2014-11-09 12:08:49 +02:00
Glauber Costa	ab3d02e347	xen: use a port class instead of an integer to represent an event channel The representation of an event channel as an integer poses a problem, in which waiting on an integer port doesn't work well when the same event channel is assigned for both tx and rx. The future will be ready for one of the sides, but we won't process the other. One alternative is to have conditions in the future processing, and in case the event channels are bound to the same port, process both events. But a better solution is to use a class to represent the bound ports, and instances of those classes will have their own pending methods. Infrastructure will be written in a following patch to make sure that all listeners to the same port will be made ready when an interrupt kicks in Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-09 11:54:08 +02:00
Glauber Costa	9cd9fda570	xen: don't take split feature for granted The backend may be completely silent about the existence of the split channels feature. In that case, trying to read through the template directly would cause an exception, since we can't convert the empty string. The backend-id, OTOH, is guaranteed to exist and wasn't using the template signature. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-09 11:54:06 +02:00
Avi Kivity	467ff9cbe2	xen: fix grant refernce leaks We copy our grant reference into a temporary, so free_ref() does not clear the real entry, causing an assert() to trigger later on. Fix by capturing the grant reference entry by reference. With this, the xen network driver survives multiple trips around the ring.	2014-11-07 14:26:00 +02:00
Avi Kivity	ee7ec972eb	xen: request tx notification on tx completion Otherwise, we never learn that transmission has completed and never recycle ring entries. This is still a little lame as we don't do any batching.	2014-11-07 14:26:00 +02:00
Avi Kivity	8a29d4a78a	xen: replenish rx ring entries Keep recycling free ring entries back into the receive ring so we can receive more than 256 packets. The code is a little lame at the moment since it writes the index and notifies the host for every frame, but that can be adjusted later.	2014-11-07 14:25:58 +02:00
Avi Kivity	8dff32eea5	xen: simplify free grant table ref id management There is no reason to wait when pushing back a free id - there is nothing that could possibly block there. Switch from a queue<> to an std::queue<> and use a semaphore to guard popping from the queue.	2014-11-07 13:09:24 +02:00
Glauber Costa	6c0aaa126c	xen: grant recycle handle buffer recycles. Right now it is very simple: allocate a new receive buffer after a succesful receival, and mark the tx spot free when we get the tx event notification. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:22:15 +01:00
Glauber Costa	0a1f5f9e73	xen: defer grant table operations Instead of returning a reference to a grant that is already present in an array, defer the initialization. This is how the OSv driver handles it, and I honestly am not sure if this is really needed: it seems to me we should be able to just reuse the old grants. I need to check in the backend code if we can be any smarter than this. However, right now we need to do something to recycle the buffers, and just re-doing the refs would lead to inconsistencies. So the best by now is to close and reopen the grants, and then later on rework this in a way that works for both the initial setup and the recycle. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:30 +01:00
Glauber Costa	722926d545	xen: factor out allocation of a single rx entry I'll need this code later to refill the buffer, so factor this out Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:30 +01:00
Glauber Costa	01c861fba4	xen: don't increment producer index in receive path Right now, we allocate the whole index, and notify the backend that we have produced nr_ents indexes. If we do that, we cannot increment the producer index when we receive a new package. This would make the index overflow, and basically, it is the responsible for the biggest part of the slowdown we are seeing. Before this patch, we're seeing 2s RTT for pings. After the patch: 64 bytes from 192.168.100.79: icmp_seq=1 ttl=64 time=0.437 ms 64 bytes from 192.168.100.79: icmp_seq=2 ttl=64 time=0.431 ms 64 bytes from 192.168.100.79: icmp_seq=3 ttl=64 time=0.475 ms Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:29 +01:00
Glauber Costa	ae1122bfc8	xen: manage index list Aside from managing the grant references, we also need to manage the positional indexes in the array. We need to keep track of which indexes are free, and which are used. Because we need the actual position number to fill xen's data structures, I figured we could use a queue and then fill it up with all the integers in our range. The queue is already futurized, so that's easy. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-06 11:21:29 +01:00
Glauber Costa	63c8db870f	xen: remove debug printfs As packet flow is working reasonably now, most of the prints can go. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-05 22:30:25 +01:00
Avi Kivity	2d14053e6e	xen: make gntref more readable Convert it from std::pair with meaningless .first and .second fields to a proper struct.	2014-11-05 15:09:04 +02:00
Avi Kivity	0a0dc6eb90	xen: provide correct checksum offload flags to the host Tell Xen when we've computed the checksum ourselves, and when we have a partial checksum filled.	2014-11-05 15:09:04 +02:00
Avi Kivity	c52b4fdc47	xen: partial support for checksum offload Checksum offload cannot be disabled in Xen (or at least, I haven't figured out how). Advertise it as enabled, so that tcp doesn't drop packets as failing their checksum. Still need to flesh out the transmit path. With this, seastar sends SYN/ACK packets in response to connection requests.	2014-11-05 15:09:04 +02:00
Avi Kivity	6581de0fa7	xen: nack features we don't support yet Pretending to support a feature we don't can lead to protocol failures.	2014-11-05 15:09:04 +02:00
Avi Kivity	2fdaac3132	xen: linearize packet before transmitting Since we haven't negotiated the scatter/gather capability yet, and we don't support the scatter protocol, linearize the packet before sending it.	2014-11-05 15:09:03 +02:00
Avi Kivity	6e193b2874	xen: fix memory barrier when writing rx buffer ring The barrier must separate writing the ring data from the ring index, otherwise the other side may see unwritten ring data.	2014-11-05 15:09:03 +02:00
Avi Kivity	9f5a4e90d1	xen: fix misaccounting of prepared rx buffers We prepared N buffers, but only told the host about one. This meant the host stopped forwarding received packets almost immediately. Fix by writing the Xen-visible ring index correctly.	2014-11-05 15:09:03 +02:00
Avi Kivity	80c8337eef	xen: don't receive packets before we've created a subscription Or the code falls over on a null _sub.	2014-11-05 15:09:03 +02:00
Glauber Costa	72abe62c4e	xenfront basic support This is the basic support for xenfront. It can be used in domU, provided there is a network interface to be hijacked. The code that follows, is just the mechanics of managing the grants, event channels, etc. However, it does not yet work: I can't see netback injecting any data into it. I am still debugging the protocol, but I wanted to flush the current state. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2014-11-05 15:09:03 +02:00

49 Commits