mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 23:13:15 +00:00

Go to file

Nadav Har'El d42c05b6ad sstable: Pull-style read interface

This patch replaces the sstable read APIs from having "push" style,
to having "pull style".

The sstable read code has two APIs:
 1. An API for sequentially consuming low-level sstable items - sstable
    row's beginning and end, cells, tombstones, etc.
 2. An API for sequentially consuming entire sstable rows in our "mutation"
    format.

Before this patch, both APIs were in "push style": The user supplies
callback functions, and the sstable read code "pushes" to these functions
the desired items (low-level sstable parts, or whole mutations).
However, a push API is very inconvenient for users, like the query
processing code, or the compaction code, which both iterate over mutations.
Such code wants to control its own progression through the iteration -
the user prefers to "pull" the next mutation when it wants it; Moreover,
the user wants to *stop* pulling more mutations if it wants, without
worrying about various continuations that are still scheduled in the
background (the latter concern was especially problematic in the "push"
design).

The modified APIs are:

1. The functions for iterating over mutations, sstable::read_rows() et al.,
   now return a "mutation_reader" object which can be used for iterating
   over the mutation: mutation_reader::read() asks for the next mutation,
   and returns a future to it (or an unassigned value on EOF).
   You can see an example on how it is used in sstable_mutation_test.cc.

2. The functions for consuming low-level sstable items (row begin, cell,
   etc.) are still partially push-style - the items are still fed into
   the consume object - but consumpton now *stops* (instead of defering
   and continuing later, as in the old code) when the consumer asks to.
   The caller can resume the consumption later when it wishes to (in
   this sense, this is a "pull" API, because the user asks for more
   input when it wants to).

This patch does *not* remove input_stream's feature of a consumer
function returning a non-ready future. However, this feature is no longer
used anywhere in our code - the new sstable reader code stops the
consumption when old sstable reader code paused it temporarily with
a non-ready future.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>

2015-06-03 10:55:34 +03:00

api

api: clean up the gossiper API impl

2015-06-02 11:13:15 +03:00

apps

Merge branch 'master' of github.com:cloudius-systems/seastar into db

2015-04-26 13:16:35 +03:00

core

distributed: document foreign_ptr<>

2015-06-02 22:37:43 +03:00

cql3

Unify column_definition::column_kind and ::column_kind enums

2015-06-02 11:22:41 +02:00

map_difference: Simplify difference value

2015-06-03 09:19:00 +03:00

dht

sstables: collect validation metadata

2015-06-02 10:32:12 +03:00

docker/dev

build: add cryptopp, libpciaccess, zlib, and libxml2 to docker installation

2015-01-19 20:13:15 +02:00

exceptions

exceptions: allow for message in unsupported operation

2015-05-19 11:22:41 -04:00

gms

gossip: Implement versioned_value for tokens

2015-06-01 11:24:38 +08:00

http

http: Support file transformers

2015-06-02 11:22:40 +02:00

interface

cassandra.thrift: add copyright and change notices

2014-12-23 19:51:56 +02:00

io: Convert io/ISerializer.java

2015-01-05 14:13:31 +08:00

json

json: float and double support

2015-06-03 10:01:00 +03:00

kvm

VM image build script to running SeaStar on Linux guest

2015-05-27 15:39:30 +03:00

licenses

licenses: add Apache License 2.0 for code copied from upstream Cassandra

2014-12-23 19:51:18 +02:00

locator

token_metadata: Move update_host_id to source file

2015-06-01 11:33:11 +08:00

message

message: drop tuple<int, long> serializer

2015-05-31 17:53:51 +03:00

net

Remove redundant const in static constexpr const

2015-05-25 11:57:19 +03:00

rpc

rpc: fix get_stats not returning the updated stats structure

2015-05-27 15:06:01 +03:00

scripts

Add LICENSE, NOTICE, and copyright headers to all source files.

2015-02-19 16:52:34 +02:00

service

storage_proxy: fix shared_ptr misuse in query_local()

2015-06-01 17:18:35 +02:00

sstables

sstable: Pull-style read interface

2015-06-03 10:55:34 +03:00

tests

sstable: Pull-style read interface

2015-06-03 10:55:34 +03:00

thrift

thrift: implement describe_keyspace[s]

2015-06-02 14:11:34 +02:00

transport

transport: Make CQL server less noisy

2015-05-20 11:51:35 +03:00

util

util: Add bytes version of serialize_string

2015-03-16 12:46:37 +01:00

utils

class_registrator: check whether the class exists

2015-06-02 14:11:34 +02:00

.gitignore

Ignore cscope files

2014-10-23 10:46:55 +03:00

.gitorderfile

gitorderfile: make changes into *.py files appear first

2015-05-12 10:13:25 +03:00

atomic_cell.hh

db: Add atomic_cell::deletion_time()

2015-05-10 12:03:26 +03:00

bytes_ostream.hh

bytes_ostream.hh: bytes_ostream::empty()

2015-04-28 15:49:34 +03:00

bytes.cc

db: Extract bytes related stuff from database.cc to bytes.cc

2015-04-29 15:50:16 +03:00

bytes.hh

db: make bytes even more distinct from sstring

2015-04-07 10:56:19 +03:00

cartesian_product.hh

Introduce cartesian product calculating helper

2015-03-11 14:56:10 +01:00

combine.hh

Add combine() template

2015-03-05 18:11:37 +02:00

compound_compat.hh

compound_compat: Remove leftover code

2015-04-30 15:40:02 +02:00

compound.hh

compound: make compound_type::type() const

2015-06-02 14:11:34 +02:00

configure.py

Merge seastar upstream

2015-06-02 15:13:42 +03:00

database_fwd.hh

db: move memtable definition to its own file

2015-05-17 12:38:32 +03:00

database.cc

db: update keyspace_metadata when column family is added

2015-06-02 14:11:34 +02:00

database.hh

db: add getter for database::_keyspaces

2015-06-02 14:11:34 +02:00

db_clock.hh

db_clock: Add now_in_usecs() helper function

2015-03-26 12:25:00 +02:00

dns.hh

stub dns resolver

2015-05-21 15:17:34 +03:00

Doxyfile

doc: don't document internal classes

2015-04-15 17:14:58 +03:00

enum_set.hh

enum_set: Introduce enum_set::of<>()

2015-04-15 20:33:49 +02:00

frozen_mutation.cc

db: Introduce frozen_mutation

2015-05-08 09:19:01 +02:00

frozen_mutation.hh

db: Introduce frozen_mutation

2015-05-08 09:19:01 +02:00

gc_clock.hh

db: Store ttl in atomic_cell

2015-05-06 19:42:38 +02:00

keys.cc

keys: Introduce view wrappers

2015-05-06 15:52:56 +02:00

keys.hh

keys: Fix make_empty()

2015-05-13 08:56:53 +02:00

LICENSE.seastar

Merge branch 'master' of github.com:cloudius-systems/seastar into db

2015-02-22 16:23:59 +02:00

log.cc

log: add slf4j-compatible logger class

2014-12-29 17:09:41 +02:00

log.hh

log: add slf4j-compatible logger class

2014-12-29 17:09:41 +02:00

main.cc

main: Set a default seed ip address

2015-05-27 13:06:00 +03:00

map_difference.hh

map_difference: Simplify difference value

2015-06-03 09:19:00 +03:00

memtable.hh

db: abstract memtable empty test

2015-05-21 15:48:51 +03:00

mutation_partition_applier.hh

db: Encapsulate deletable_row fields

2015-05-13 08:56:54 +02:00

mutation_partition_serializer.cc

db: Encapsulate deletable_row fields

2015-05-13 08:56:54 +02:00

mutation_partition_serializer.hh

db: Introduce frozen_mutation

2015-05-08 09:19:01 +02:00

mutation_partition_view.cc

db: Introduce frozen_mutation

2015-05-08 09:19:01 +02:00

mutation_partition_view.hh

db: Introduce frozen_mutation

2015-05-08 09:19:01 +02:00

mutation_partition_visitor.hh

db: Introduce frozen_mutation

2015-05-08 09:19:01 +02:00

mutation_partition.cc

db: Encapsulate deletable_row fields

2015-05-13 08:56:54 +02:00

mutation_partition.hh

db: Encapsulate deletable_row fields

2015-05-13 08:56:54 +02:00

mutation.cc

db: Encapsulate deletable_row fields

2015-05-13 08:56:54 +02:00

mutation.hh

mutation: add a move assignment operator

2015-05-21 16:27:35 -04:00

NOTICE

Add LICENSE, NOTICE, and copyright headers to all source files.

2015-02-19 16:52:34 +02:00

NOTICE.txt

Add NOTICE file as required by the Apache license.

2014-12-24 09:47:18 +02:00

nway_merger.hh

nway_merger: allow for comparators without default constructors

2015-05-05 19:37:21 +03:00

ORIGIN

Add ORIGIN file to remind us which sources (exact revision) we are converting

2014-12-24 09:45:34 +02:00

partition_builder.hh

db: Encapsulate deletable_row fields

2015-05-13 08:56:54 +02:00

query_result_merger.hh

db: split query.hh to reduce header dependencies

2015-04-15 20:44:59 +02:00

query-request.hh

db: Add clarifying description to query_command and partition_slice

2015-05-13 08:56:54 +02:00

query-result-reader.hh

db: Rename "ttl" to "expiry" when it's used as time point

2015-05-06 17:27:22 +02:00

query-result-set.cc

query-result-set.hh: Use data_value instead of boost::any

2015-05-27 11:49:12 +03:00

query-result-set.hh

query-result-set.hh: Add comparison operators

2015-05-27 11:49:12 +03:00

query-result-writer.hh

db: Add atomic_cell::deletion_time()

2015-05-10 12:03:26 +03:00

query-result.hh

db: Rename "ttl" to "expiry" when it's used as time point

2015-05-06 17:27:22 +02:00

query.cc

db: split query.hh to reduce header dependencies

2015-04-15 20:44:59 +02:00

README-OSv.md

README: Rename README-OSv to README-OSv.md

2015-05-05 10:09:18 +03:00

README-urchin.md

README: Add more missing packages for building urchin

2015-05-07 12:51:50 +03:00

README.md

Revert "dpdk: use combined library"

2015-04-15 12:45:06 +03:00

schema_builder.hh

Make schema_builder constructible from schema

2015-06-02 11:22:42 +02:00

schema.cc

Make schema_builder constructible from schema

2015-06-02 11:22:42 +02:00

schema.hh

Make schema_builder constructible from schema

2015-06-02 11:22:42 +02:00

serialization_format.hh

db: add equality compare operators to serialization_format

2015-04-10 22:26:41 +03:00

test.py

tests: add sstable mutation test to the full set of regression tests

2015-05-21 09:01:46 +02:00

timestamp.hh

Decompose database.hh, types.hh into smaller headers

2015-03-04 16:18:48 +02:00

to_string.hh

db: make join() work on any range, not just a vector

2015-04-01 20:12:39 +03:00

tombstone.hh

tombstone: rename ttl to deletion_time

2015-03-30 09:07:01 +02:00

types.cc

database: Remove compose() function

2015-05-27 16:22:35 +03:00

types.hh

database: Fix data_value comparison operator

2015-05-28 11:21:22 +02:00

unimplemented.cc

sstables: implement conversion of range tombstone

2015-05-13 17:38:56 -04:00

unimplemented.hh

sstables: implement conversion of range tombstone

2015-05-13 17:38:56 -04:00

validation.cc

Relax header dependencies

2015-04-24 18:01:01 +02:00

validation.hh

Relax header dependencies

2015-04-24 18:01:01 +02:00

README.md

Seastar

Introduction

SeaStar is an event-driven framework allowing you to write non-blocking, asynchronous code in a relatively straightforward manner (once understood). It is based on futures.

Building Seastar

Building seastar on Fedora 21

Installing required packages:

yum install gcc-c++ libaio-devel ninja-build ragel hwloc-devel numactl-devel libpciaccess-devel cryptopp-devel xen-devel boost-devel

You then need to run the following to create the "build.ninja" file:

./configure.py

Note it is enough to run this once, and you don't need to repeat it before every build. build.ninja includes a rule which will automatically re-run ./configure.py if it changes.

Then finally:

ninja-build

Building seastar on Fedora 20

Installing GCC 4.9 for gnu++1y:

Beware that this installation will replace your current GCC version.

yum install fedora-release-rawhide
yum --enablerepo rawhide update gcc-c++
yum --enablerepo rawhide install libubsan libasan

Installing required packages:

yum install libaio-devel ninja-build ragel hwloc-devel numactl-devel libpciaccess-devel cryptopp-devel

You then need to run the following to create the "build.ninja" file:

./configure.py

Note it is enough to run this once, and you don't need to repeat it before every build. build.ninja includes a rule which will automatically re-run ./configure.py if it changes.

Then finally:

ninja-build

Building seastar on Ubuntu 14.04

Installing required packages:

sudo apt-get install libaio-dev ninja-build ragel libhwloc-dev libnuma-dev libpciaccess-dev libcrypto++-dev libboost-all-dev

Installing GCC 4.9 for gnu++1y. Unlike the Fedora case above, this will not harm the existing installation of GCC 4.8, and will install an additional set of compilers, and additional commands named gcc-4.9, g++-4.9, etc., that need to be used explicitly, while the "gcc", "g++", etc., commands continue to point to the 4.8 versions.

# Install add-apt-repository
sudo apt-get install software-properties-common python-software-properties
# Use it to add Ubuntu's testing compiler repository
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
# Install gcc 4.9 and relatives
sudo apt-get install g++-4.9
# Also set up necessary header file links and stuff (?)
sudo apt-get install gcc-4.9-multilib g++-4.9-multilib

To compile Seastar explicitly using gcc 4.9, use:

./configure.py --compiler=g++-4.9

To compile OSv explicitly using gcc 4.9, use:

make CC=gcc-4.9 CXX=g++-4.9 -j 24

Building seastar in Docker container

To build a Docker image:

docker build -t seastar-dev docker/dev

Create an shell function for building insider the container (bash syntax given):

$ seabuild() { docker run -v $HOME/seastar/:/seastar -u $(id -u):$(id -g) -w /seastar -t seastar-dev "$@"; }

(it is recommended to put this inside your .bashrc or similar)

To build inside a container:

$ seabuild ./configure.py
$ seabuild ninja-build

Building with a DPDK network backend

Setup host to compile DPDK:
- Ubuntu

sudo apt-get install -y build-essential linux-image-extra-`uname -r`

Prepare a DPDK SDK:

Download the latest DPDK release: wget http://dpdk.org/browse/dpdk/snapshot/dpdk-1.8.0.tar.gz
Untar it.
Edit config/common_linuxapp: set CONFIG_RTE_MBUF_REFCNT to 'n'.
For DPDK 1.7.x: edit config/common_linuxapp:
- Set CONFIG_RTE_LIBRTE_PMD_BOND to 'n'.
- Set CONFIG_RTE_MBUF_SCATTER_GATHER to 'n'.
- Set CONFIG_RTE_LIBRTE_IP_FRAG to 'n'.
Start the tools/setup.sh script as root.
Compile a linuxapp target (option 9).
Install IGB_UIO module (option 11).
Bind some physical port to IGB_UIO (option 17).
Configure hugepage mappings (option 14/15).

Run a configure.py: ./configure.py --dpdk-target <Path to untared dpdk-1.8.0 above>/x86_64-native-linuxapp-gcc --compiler=g++-4.9.
Run ninja-build.

To run with the DPDK backend for a native stack give the seastar application --dpdk-pmd 1 parameter.

Futures and promises

A future is a result of a computation that may not be available yet. Examples include:

a data buffer that we are reading from the network
the expiration of a timer
the completion of a disk write
the result computation that requires the values from one or more other futures.

a promise is an object or function that provides you with a future, with the expectation that it will fulfill the future.

Promises and futures simplify asynchronous programming since they decouple the event producer (the promise) and the event consumer (whoever uses the future). Whether the promise is fulfilled before the future is consumed, or vice versa, does not change the outcome of the code.

Consuming a future

You consume a future by using its then() method, providing it with a callback (typically a lambda). For example, consider the following operation:

future<int> get();   // promises an int will be produced eventually
future<> put(int)    // promises to store an int

void f() {
    get().then([] (int value) {
        put(value + 1).then([] {
            std::cout << "value stored successfully\n";
        });
    });
}

Here, we initiate a get() operation, requesting that when it completes, a put() operation will be scheduled with an incremented value. We also request that when the put() completes, some text will be printed out.

Chaining futures

If a then() lambda returns a future (call it x), then that then() will return a future (call it y) that will receive the same value. This removes the need for nesting lambda blocks; for example the code above could be rewritten as:

future<int> get();   // promises an int will be produced eventually
future<> put(int)    // promises to store an int

void f() {
    get().then([] (int value) {
        return put(value + 1);
    }).then([] {
        std::cout << "value stored successfully\n";
    });
}

Loops

Loops are achieved with a tail call; for example:

future<int> get();   // promises an int will be produced eventually
future<> put(int)    // promises to store an int

future<> loop_to(int end) {
    if (value == end) {
        return make_ready_future<>();
    }
    get().then([end] (int value) {
        return put(value + 1);
    }).then([end] {
        return loop_to(end);
    });
}

The make_ready_future() function returns a future that is already available --- corresponding to the loop termination condition, where no further I/O needs to take place.

Under the hood

When the loop above runs, both then method calls execute immediately --- but without executing the bodies. What happens is the following:

get() is called, initiates the I/O operation, and allocates a temporary structure (call it f1).
The first then() call chains its body to f1 and allocates another temporary structure, f2.
The second then() call chains its body to f2.

Again, all this runs immediately without waiting for anything.

After the I/O operation initiated by get() completes, it calls the continuation stored in f1, calls it, and frees f1. The continuation calls put(), which initiates the I/O operation required to perform the store, and allocates a temporary object f12, and chains some glue code to it.

After the I/O operation initiated by put() completes, it calls the continuation associated with f12, which simply tells it to call the continuation associated with f2. This continuation simply calls loop_to(). Both f12 and f2 are freed. loop_to() then calls get(), which starts the process all over again, allocating new versions of f1 and f2.

Handling exceptions

If a .then() clause throws an exception, the scheduler will catch it and cancel any dependent .then() clauses. If you want to trap the exception, add a .then_wrapped() clause at the end:

future<buffer> receive();
request parse(buffer buf);
future<response> process(request req);
future<> send(response resp);

void f() {
    receive().then([] (buffer buf) {
        return process(parse(std::move(buf));
    }).then([] (response resp) {
        return send(std::move(resp));
    }).then([] {
        f();
    }).then_wrapped([] (auto&& f) {
        try {
            f.get();
        } catch (std::exception& e) {
            // your handler goes here
        }
    });
}

The previous future is passed as a parameter to the lambda, and its value can be inspected with f.get(). When the get() variable is called as a function, it will re-throw the exception that aborted processing, and you can then apply any needed error handling. It is essentially a transformation of

buffer receive();
request parse(buffer buf);
response process(request req);
void send(response resp);

void f() {
    try {
        while (true) {
            auto req = parse(receive());
            auto resp = process(std::move(req));
            send(std::move(resp));
        }
    } catch (std::exception& e) {
        // your handler goes here
    }
}

Note, however, that the .then_wrapped() clause will be scheduled both when exception occurs or not. Therefore, the mere fact that .then_wrapped() is executed does not mean that an exception was thrown. Only the execution of the catch block can guarantee that.

This is shown below:


future<my_type> my_future();

void f() {
    receive().then_wrapped([] (future<my_type> f) {
        try {
            my_type x = f.get();
            return do_something(x);
        } catch (std::exception& e) {
            // your handler goes here
        }
    });
}

Setup notes

SeaStar is a high performance framework and tuned to get the best performance by default. As such, we're tuned towards polling vs interrupt driven. Our assumption is that applications written for SeaStar will be busy handling 100,000 IOPS and beyond. Polling means that each of our cores will consume 100% cpu even when no work is given to it.

Recommended hardware configuration for SeaStar

CPUs - As much as you need. SeaStar is highly friendly for multi-core and NUMA
NICs - As fast as possible, we recommend 10G or 40G cards. It's possible to use 1G to but you may be limited by their capacity. In addition, the more hardware queue per cpu the better for SeaStar. Otherwise we have to emulate that in software.
Disks - Fast SSDs with high number of IOPS.
Client machines - Usually a single client machine can't load our servers. Both memaslap (memcached) and WRK (httpd) cannot over load their matching server counter parts. We recommend running the client on different machine than the servers and use several of them.

Languages

C++ 72.1%

Python 26.7%

CMake 0.3%

GAP 0.3%

Shell 0.3%