There's a proper column_family in database.hh now. Remove a stub that
was introduced during the initial conversion.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
From Avi:
Rather rough sharding of the database.
DecoratedKey/Token were de-abstracted (they're concrete types now). This
means that some of their type information can be held in the partitioner,
so anyone playing with tokens needs access to the partitioner.
Sharding is simplistic, using the first byte of the hash as a key to
select the shard (with a modulo operation).
Add database::shard_of() to compute the shard hosting the partition
(with a simplistic algorithm, but perhaps not too bad).
Convert non-metadata invoke_on_all() and local calls on the database
to use shard_of().
We don't follow origin precisely in normalizing the token (converting a
zero to something else). We probably should, to allow direct import of
a database.
Rather than converting to unsigned longs for the fractional computations,
do them it bytes. The overhead of allocating longs will be larger than
the computation, given that tokens are usually short (8 bytes), and
our bytes type stores them inline.
Origin uses abstract types for Token; for two reasons:
1. To create a distinction between tokens for keys and tokens
that represent the end of the range
2. To use different implementations for tokens belonging to different
partitioners.
Using abstract types carries a penalty of indirection, more complex
memory management, and performance. We can eliminate it by using
a concrete type, and defer any differences in the implementation
to the partitioner. End-of-range token representation is folded into
the token class.
This initial attemt at sharding broadcasts all writes while directing
reads to the local shards.
Further refinements will keep schema updates broadcasted, but will unicast
row mutations to their shard (and conversely convert row reads from reading
the local shard to unicast as well).
Of particular interest is the change to the thrift handler, where a
sequential application of mutations is converted to parallel application,
in order to hide SMP latency and improve batching.
s/database/distributed<database>/ everywhere.
Use simple distribution rules: writes are broadcast, reads are local.
This causes tremendous data duplication, but will change soon.
CPU may automatically prefetch next cache line, so if statistics that
are collected on different cpus resided on adjacent cache lines CPU may
erroneously prefetch cache line that is guarantied to be accessed by
another CPU. Fix it by putting a cache line between two structures.
If data buffers decoupling from the rte_mbuf is not available (hugetlbfs is not
available)copy the newly received data into the memory buffer we allocate and
build the "packet" object from this buffer. This will allow us returning the
rte_mbuf immediately which would solve the same issue the "decoupling" is solving
when hugetlbfs is available.
The implementation is simplistic (no preallocation, packet data cache alignment, etc.).
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Allocate the data buffers instead of using the default inline rte_mbuf
layout.
- Implement an rx_gc() and add an _rx_gc_poller to call it: we will refill
the rx mbuf's when at least 64 free buffers.
This threshold has been chosen as a sane enough number.
- Introduce the mbuf_data_size == 4K. Allocate 4K buffers for a detached flow.
We are still going to allocate 2K data buffers for an inline case since 4K
buffers would require 2 pages per mbuf due to "mbuf_overhead".
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
std::vector is promised to be continuous storage while std::deque is not.
In addition std::vector's semantics yields a simpler code than those of deque.
Therefore std::vector should deliver a better performance for a stack semantics
we need here.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Take into an account the alignment, header and trailer that mempool is adding
to the elements.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
With replication, we want the contents of the mutation to be available
to multiple replicas.
(In this context, we will replicate the mutation to all shards in the same
node, as a temporary step in sharding a node; but the issue also occurs
when replicating to other nodes).
Futures hold either a value or an exception; thrift uses two separate
function objects to signal completion, one for success, the other for
an exception.
Add a helper to pass the result of a future to either of these.
When using print() to debug on smp, it is very annoying to get interleaved
output.
Fix by wrapping stdout with a fake stream that has a line buffer for each
thread.
Our file_stream interface supports seek, but when we try to seek to arbitrary
locations that are smaller than an aio-boundary (say, for instance, f->seek(4)),
we will end up not being able to perform the read.
We need to guarantee the reads are aligned, and will then present to the caller
the buffer properly offset.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Current variants of distributed<T>::invoke_on() require member function to
invoke, which may be tedious to implement for some cases. Add a variant
that supports invoking a functor, accepting the local instance by reference.
Some of the core functions accept functions returning either an immediate
type, or a future, and return a future in either case (e.g. smp::submit_to()).
To make it easier to metaprogram with these functions, provide a utility
that computes the return type, futurize<T>:
futurize_t<bar> => future<bar>
futurize_t<void> => future<>
futurize_t<future<bar>> => future<bar>
- tcp.hh: Properly calculate the pseudo-header in the TSO case: it should be
calculated as if ip_len is zero.
- Enable TSO in the DPDK network backend.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Fix obvious bottlenecks in mutations, from Tomasz:
"These changes improve throughput in perf_mutation test on my laptop 20 times,
from ~120K to 2.4M tps."
deserialize_value() is slow because it involves multiple allocations
and copies. Internal operations such as compare() or hash() don't need
all that heavy transformations, now that those functions work on
bytes_view we can iterate over component values in-place.