Commit Graph

1853 Commits

Author SHA1 Message Date
Pekka Enberg
86355a54a5 db: Remove column_family stub
There's a proper column_family in database.hh now. Remove a stub that
was introduced during the initial conversion.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-02-23 14:31:36 +02:00
Tomasz Grabiec
e214c0192a Merge branch seastar-dev.git avi/smp
From Avi:

Rather rough sharding of the database.

DecoratedKey/Token were de-abstracted (they're concrete types now).  This
means that some of their type information can be held in the partitioner,
so anyone playing with tokens needs access to the partitioner.

Sharding is simplistic, using the first byte of the hash as a key to
select the shard (with a modulo operation).
2015-02-23 10:53:25 +01:00
Avi Kivity
2720ba34bf db: shard data
Add database::shard_of() to compute the shard hosting the partition
(with a simplistic algorithm, but perhaps not too bad).

Convert non-metadata invoke_on_all() and local calls on the database
to use shard_of().
2015-02-23 11:37:12 +02:00
Avi Kivity
0db67ff121 thrift: add foreign_ptr<> variant to complete()
Some calls will return complex types, so allow them to return a foreign_ptr<>
to ensure cleanup will happen in the correct place.
2015-02-23 11:37:12 +02:00
Avi Kivity
6f7dff825c dht: add global partitioner
Skip configuration for now and go for murmur3 instead.
2015-02-23 11:37:12 +02:00
Avi Kivity
d0dae3f938 dht: add murmur3_partitioner
We don't follow origin precisely in normalizing the token (converting a
zero to something else).  We probably should, to allow direct import of
a database.
2015-02-23 11:37:12 +02:00
Avi Kivity
fd9f90f45a dht: implement token minimum and midpoint
Rather than converting to unsigned longs for the fractional computations,
do them it bytes.  The overhead of allocating longs will be larger than
the computation, given that tokens are usually short (8 bytes), and
our bytes type stores them inline.
2015-02-23 10:20:24 +02:00
Avi Kivity
edc4ac4231 dht: de-virtualize token
Origin uses abstract types for Token; for two reasons:

 1. To create a distinction between tokens for keys and tokens
    that represent the end of the range
 2. To use different implementations for tokens belonging to different
    partitioners.

Using abstract types carries a penalty of indirection, more complex
memory management, and performance.  We can eliminate it by using
a concrete type, and defer any differences in the implementation
to the partitioner.  End-of-range token representation is folded into
the token class.
2015-02-23 10:20:24 +02:00
Avi Kivity
41ba192590 dht: make i_partitioner's methods public, and add a destructor 2015-02-23 10:20:22 +02:00
Avi Kivity
a166f74a54 dht: rename DecoratedKey and Token to fit with naming conventions 2015-02-22 17:53:32 +02:00
Avi Kivity
83430355c2 Merge branch 'master' of github.com:cloudius-systems/seastar into db
LICENSE moved to LICENSE.seastar, since (at least for now) urchin is
not open source.

Conflicts:
	apps/seastar/main.cc
2015-02-22 16:23:59 +02:00
Avi Kivity
c35b61bda5 distributed: fix return type of lambda-friendly invoke_on()
result_of<> is mean to be used only with its 'type' member; use result_of_t
instead.
2015-02-22 16:20:36 +02:00
Takuya ASADA
2d4b46d8e7 README.md: add Fedora 21 document
Since Fedora 21, we don't need to use rawhide package for g++ 4.9.

Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-02-20 12:13:22 +02:00
Avi Kivity
f116ef8fb6 file: skip . and .. in directory listings 2015-02-20 10:13:15 +02:00
Tomasz Grabiec
c36aab3fe2 Merge seastar-dev avi/smp
This initial attemt at sharding broadcasts all writes while directing
reads to the local shards.

Further refinements will keep schema updates broadcasted, but will unicast
row mutations to their shard (and conversely convert row reads from reading
the local shard to unicast as well).

Of particular interest is the change to the thrift handler, where a
sequential application of mutations is converted to parallel application,
in order to hide SMP latency and improve batching.
2015-02-19 18:04:57 +01:00
Avi Kivity
842a0f70ca Revert "core: demangle stdout"
This reverts commit a8698fa17c62542a98e928d395f38e66f6ff148c; causes
problems with dpdk.

Conflicts:
	core/stdio.cc
	core/stdio.hh
2015-02-19 18:55:43 +02:00
Avi Kivity
caa83858f0 reactor: handle SIGTERM and SIGINT once
stop() is not prepared to be called twice.
2015-02-19 18:45:50 +02:00
Avi Kivity
cb63d16b40 thrift: get rid of useless try/catch
Exceptions are now handled with then_wrapped(), nothing is left to catch.
2015-02-19 18:00:03 +02:00
Avi Kivity
70381a6da5 db: distribute database object
s/database/distributed<database>/ everywhere.

Use simple distribution rules: writes are broadcast, reads are local.
This causes tremendous data duplication, but will change soon.
2015-02-19 17:53:13 +02:00
Gleb Natapov
3928092053 smp: put a cache line between smp statistics structures
CPU may automatically prefetch next cache line, so if statistics that
are collected on different cpus resided on adjacent cache lines CPU may
erroneously prefetch cache line that is guarantied to be accessed by
another CPU. Fix it by putting a cache line between two structures.
2015-02-19 17:00:18 +02:00
Vlad Zolotarov
e752975b08 DPDK: Copy the Rx data into the allocated buffer in a non-decopuled case
If data buffers decoupling from the rte_mbuf is not available (hugetlbfs is not
available)copy the newly received data into the memory buffer we allocate and
build the "packet" object from this buffer. This will allow us returning the
rte_mbuf immediately which would solve the same issue the "decoupling" is solving
when hugetlbfs is available.

The implementation is simplistic (no preallocation, packet data cache alignment, etc.).

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-19 16:59:48 +02:00
Vlad Zolotarov
06565c80d5 DPDK: Decouple Rx data buffers
- Allocate the data buffers instead of using the default inline rte_mbuf
     layout.
   - Implement an rx_gc() and add an _rx_gc_poller to call it: we will refill
     the rx mbuf's when at least 64 free buffers.
     This threshold has been chosen as a sane enough number.
   - Introduce the mbuf_data_size == 4K. Allocate 4K buffers for a detached flow.
     We are still going to allocate 2K data buffers for an inline case since 4K
     buffers would require 2 pages per mbuf due to "mbuf_overhead".

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-19 16:58:54 +02:00
Vlad Zolotarov
a4345fa9bf DPDK: Rename: mbuf_size -> inline_mbuf_size and mbuf_data_size -> inline_mbuf_data_size
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-19 16:58:52 +02:00
Vlad Zolotarov
e04cb93d2f DPDK: Use std::vector instead of std::deque for a tx_buf_factory internal cache
std::vector is promised to be continuous storage while std::deque is not.
In addition std::vector's semantics yields a simpler code than those of deque.
Therefore std::vector should deliver a better performance for a stack semantics
we need here.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-19 16:58:52 +02:00
Vlad Zolotarov
d22af60057 DPDK: Fix the mempool external buffer size calculation
Take into an account the alignment, header and trailer that mempool is adding
to the elements.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-19 16:58:51 +02:00
Gleb Natapov
bebefe2afe net: return reference to hw_feature instead of copying the structure
I noticed that tcp::hw_features() is not inlined and copies the
structure to a caller. The function takes ~1.5% in httpd profiling.
2015-02-19 16:58:50 +02:00
Avi Kivity
7f8d88371a Add LICENSE, NOTICE, and copyright headers to all source files.
The two files imported from the OSv project retain their original licenses.
2015-02-19 16:52:34 +02:00
Avi Kivity
e8096ff2bb db: add database::stop()
Required by distributed<>'s contract.
2015-02-19 16:33:27 +02:00
Avi Kivity
8f9f794a73 db: make column_family::apply(mutation) not steal the contents
With replication, we want the contents of the mutation to be available
to multiple replicas.

(In this context, we will replicate the mutation to all shards in the same
node, as a temporary step in sharding a node; but the issue also occurs
when replicating to other nodes).
2015-02-19 16:23:09 +02:00
Avi Kivity
a2519926a6 db: add some iostream output operators
Helps debugging
2015-02-19 15:56:26 +02:00
Avi Kivity
3ec83658f3 thrift: store the keyspace name in set_keyspace()
The keyspace pointer is only valid for the local shard.
2015-02-19 15:55:17 +02:00
Avi Kivity
93818692e1 thrift: add adapter from futures to thrift completion objects
Futures hold either a value or an exception; thrift uses two separate
function objects to signal completion, one for success, the other for
an exception.

Add a helper to pass the result of a future to either of these.
2015-02-19 09:32:18 +02:00
Avi Kivity
96a93a2d8c thrift: add workaround for compile breakage due to thrift code generator 2015-02-19 09:32:18 +02:00
Avi Kivity
b795e9375f Merge branch 'master' of github.com:cloudius-systems/seastar into db 2015-02-19 09:28:04 +02:00
Avi Kivity
a8698fa17c core: demangle stdout
When using print() to debug on smp, it is very annoying to get interleaved
output.

Fix by wrapping stdout with a fake stream that has a line buffer for each
thread.
2015-02-19 09:26:17 +02:00
Glauber Costa
861d2625b2 file_stream: proper seek support.
Our file_stream interface supports seek, but when we try to seek to arbitrary
locations that are smaller than an aio-boundary (say, for instance, f->seek(4)),
we will end up not being able to perform the read.

We need to guarantee the reads are aligned, and will then present to the caller
the buffer properly offset.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-02-18 22:56:07 +02:00
Tomasz Grabiec
e8d22e5598 tuple: Fix component iterator
The iterator prematurely compared equal with end().
2015-02-18 21:59:35 +02:00
Gleb Natapov
c4c5899f89 net: handle arp resolution errors in tcp
Pass timeouts up the calling chain and schedule retry if waiter list is
too long.
2015-02-18 20:12:08 +02:00
Avi Kivity
07bb7cceb1 tests: fix mutation_test debug build 2015-02-18 20:09:41 +02:00
Avi Kivity
7c5755e76c Merge branch 'master' of github.com:cloudius-systems/seastar into db 2015-02-18 20:01:12 +02:00
Avi Kivity
84234b5b9a memory: implement the C11 aligned_alloc() function 2015-02-18 19:48:18 +02:00
Avi Kivity
b4098dac2f core: add distributed::invoke_on() variants not requiring a pointer to member
Current variants of distributed<T>::invoke_on() require member function to
invoke, which may be tedious to implement for some cases.  Add a variant
that supports invoking a functor, accepting the local instance by reference.
2015-02-18 16:52:56 +02:00
Avi Kivity
17914a80cd future: add a utility to promote a type to a its own future
Some of the core functions accept functions returning either an immediate
type, or a future, and return a future in either case (e.g. smp::submit_to()).

To make it easier to metaprogram with these functions, provide a utility
that computes the return type, futurize<T>:

   futurize_t<bar>          => future<bar>

   futurize_t<void>         => future<>

   futurize_t<future<bar>>  =>  future<bar>
2015-02-18 16:52:56 +02:00
Gleb Natapov
f7cade107b seawreck: abort on a connection error 2015-02-18 16:52:56 +02:00
Gleb Natapov
1cfaa7eefe net: populate dpdk redirection table even if there is only one queue
tcp::connect() uses redirection table to figure out what queue will
handle a connection.
2015-02-18 16:52:56 +02:00
Avi Kivity
c1abe0e573 smp: remove gratuitous cache miss when no responses are pending
boost::lockfree::spsc_queue::push() writes the producer index even when no
data is pushed, so check whether we need to do any work beforehand.
2015-02-17 18:00:12 +02:00
Vlad Zolotarov
1934160549 DPDK: Add TSO support
- tcp.hh: Properly calculate the pseudo-header in the TSO case: it should be
     calculated as if ip_len is zero.
   - Enable TSO in the DPDK network backend.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-02-17 12:47:13 +02:00
Avi Kivity
112277e4e7 Merge branch 'types' into db
Fix obvious bottlenecks in mutations, from Tomasz:

"These changes improve throughput in perf_mutation test on my laptop 20 times,
 from ~120K to 2.4M tps."
2015-02-17 12:43:22 +02:00
Tomasz Grabiec
245f4dfe00 types: Avoid allocation in simple_type_impl::less() 2015-02-17 12:43:15 +02:00
Tomasz Grabiec
ad3ffd2e96 tuple: Remove internal deserialize_value() usages
deserialize_value() is slow because it involves multiple allocations
and copies. Internal operations such as compare() or hash() don't need
all that heavy transformations, now that those functions work on
bytes_view we can iterate over component values in-place.
2015-02-17 12:43:15 +02:00