Commit Graph

6040 Commits

Author SHA1 Message Date
Glauber Costa
e1945e473b row consumer: make non_consuming an instance member
It is now a static member that gets the instance members as parameters.  There
is no reason for that, and this will complicate the decoupling, since the
prestate reader won't know about state.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 17:18:19 -05:00
Glauber Costa
f45b807f34 row consumer: move proceed class to a separate class
Continuing the work of decoupling the the prestate and state parts of the NSM
so we can reuse it, move the proceed class to a different holding class.
Proceeding or not has nothing to do with "rows".

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 17:18:06 -05:00
Glauber Costa
49ac04a60a row consumer: fall through more often
Because we didn't had before a way to know whether or not the read completed,
we would always go back to the main loop, and would only optimize sequential
reads for some kinds of data.

However, As one could see in the previous patch, the new read_X functions will
notify completion, allowing us to just fallthrough to the next case if that is
the only possibility. In most cases, it isn't. With this, we can apply this
optimization throughout all cases where we don't branch states, and with a very
elegant resulting code.

The performance actually increases by 0.75 %. It is not much, but it is more
than the error margin (which sits at 0.20 %), and because the code is not made
unreadable by it, this is a clear win to me.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
1f930cda4a row consumer: extend use of read for multi-value fields
In an attempt to gain some cycles, we are testing whether we can read many
values at once, and if so, using consume_be directly for those.

What we can do in this situation, is read the first value, and let the read
fall through the next case if the read succeeds.

The code actually looks a lot more elegant this way.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
0ad8afb0ec row consumer: extend usage of the read_* functions
In some places, we cannot use our read_* functions, because we don't know
whether or not it succeeded, and that is important when passing the state
along.

The fix for this is trivial, since we can just return it from the reader.

Note for reviewers: The commend in one of the functions say we should use:
"read_bytes(data, _u32, _key ...". But in the actual code, the where buffer is
_val, not _key.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
13af0ffbd2 row consumer: fix read_bytes temporary len
It shouldn't be _u16, but rather whatever we passed as len. It currently works
because all callers pass _u16 as len. But this will soon change.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
62a26ef411 row consumer: don't switch state implicitly
Soon enough, all the state machine will be separated from the prestate handling.
To make it easier, we will decouple them as much as we can.

Not automatically switching states in the read functions is part of this.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Avi Kivity
012fd41fc0 db: hard dirty memory limit
Unlike cache, dirty memory cannot be evicted at will, so we must limit it.

This patch establishes a hard limit of 50% of all memory.  Above that,
new requests are not allowed to start.  This allows the system some time
to clean up memory.

Note that we will need more fine-grained bandwidth control than this;
the hard limit is the last line of defense against running our of reclaimable
memory.

Tested with a mixed read/write load; after reads start to dominate writes
(due to the proliferation of small sstables, and the inability of compaction
to keep up, dirty memory usage starts to climb until the hard stop prevents
it from climbing further and ooming the server).
2015-08-28 14:47:17 +02:00
Avi Kivity
f171d71c16 utils: optimize murmur3_hash data fetch
By using a recognized idiom, gcc can optimize the unaligned little endian
load as a single instruction (actually less than an instruction, as it
combines it with a succeeding arithmetic operation).
2015-08-28 12:37:43 +03:00
Avi Kivity
cb1372523a Merge "CQL code cleanups" from Pekka
"Here's another round of cleanups to the CQL code. Nothing exciting here,
mostly moving code to source files which makes changing the code less
painful in terms of compilation times."
2015-08-27 18:32:45 +03:00
Pekka Enberg
28aad6fa67 cql3: Move ks_prop_defs implementation to source file
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 18:16:28 +03:00
Avi Kivity
7e8c6eddbb Merge "Buffer related read performance improvement" from Glauber
"As we could see, the flamegraphs shows a lot of performance still left in the
table.  However, from the I/O point of view, we have determined through our
write performance testing, that 128k is the sweet spot for buffers. Worse yet:
reads are still trapped at 8k.

While it is true that when we want to read just a little data, smaller is
better, it is also true that reads (and now that includes the index), tend to
give hints about the size they want read.

So we can read the whole thing at once if smaller than 128k, or chop it at 128k
increments if they are not.

The performance gains coming from doing this are considerable: 39 % for data,
67 % for index."
2015-08-27 18:07:27 +03:00
Pekka Enberg
c2ff7b67ce cql3: Move user_types implementation to source file
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 17:50:54 +03:00
Avi Kivity
cf0825182e Merge "New modes for sstable perf tests" from Glauber
"index_read, sequential_read, and write"
2015-08-27 17:26:42 +03:00
Glauber Costa
74a4843d2a sstable index: try to read everything in a single run
We know the correct boundaries now, so we can use that information to
feed the default buffer size: if it is small enough (smaller than 128k),
we can try to bring everything at once.

For the default key sized 128 that we use in the index read perf:
(smp == 1, partitions = 500000, concurrency == 1)

Before:
423493.26 +- 811.03 partitions / sec (30 runs, 1 concurrent ops)

After:
707311.86 +- 1865.47 partitions / sec (30 runs, 1 concurrent ops)

For a gain of 67 %.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 09:19:05 -05:00
Pekka Enberg
a189eb1e93 cql3: Kill some ifdef'd code from cf_prop_defs.hh
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 17:18:28 +03:00
Pekka Enberg
675fe7d4c5 cql3: Move cf_prop_defs implementation to source file
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 17:17:42 +03:00
Glauber Costa
79db076f20 sstables: allow specification of buffer_size for data_stream
For sstable reads, bigger buffers are not always better, because it can be
the case that we want to read just a piece of data.

However, as it so happens, we have already two variants for read: when we
want to read a single key, we will use read_row(), which will try to bring all
data in: so it will use smaller buffer.

For read_rows(), that will naturally span multiple buffers, we have end and
start points: with that, we can have a good estimation of the expected buffer
size, and we can have it go up until we reaches the 128k limit we have for
writes.

Before:
209578.62 +- 135.73 partitions / sec (30 runs, 1 concurrent ops)
After:
291703.98 +- 218.95 partitions / sec (30 runs, 1 concurrent ops)

Gain:
39.19 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 09:11:13 -05:00
Glauber Costa
bd272fe6aa perf_sstable: test sequential reads from an sstable.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 09:02:11 -05:00
Glauber Costa
b194509a6d perf_write: test for full writes
it writes 5 columns (configurable) per row. This will exercise other paths
aside from the index.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 09:02:11 -05:00
Glauber Costa
dcd312a982 perf_sstable: more than just the index
My plan was originally to have two separate sets of tests: one for the index,
and one for the data. With most of the code having ended up in the .hh file anyway,
this distinction became a bit pointless.

Let's put it everything here.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 09:02:11 -05:00
Glauber Costa
b3b0aff85e perf_sstable_index: add test for index_read
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 09:02:11 -05:00
Glauber Costa
93e55969f2 sstables: modify read_indexes so it no longer takes a quantity
read_indexes was one of the first functions coded in the sstable read path. At
the time, I made the (now so obviously) wrong decision to code it generic
enough so that we could specify the number of items to be read, instead of an
upper bound in the file.

The main reason for that, was that without the Summary, we have no way to know
where to stop reading, and the Summary is a relatively new addition to the C*
codebase: while I didn't really check when it got in, the code is full of tests
for its presence.

That turned out to be totally useless: we always read the indexes with the help
of the Summary. While the Summary is a relatively new addition to C*, it is
present in all version we aim to support. Meaning that reads without the
Summary will never happen in our codebase.

Even if, in the future, we happen to ditch the Summary file, we are very likely
to do so in favor of some other structure that also allows us to manipulate precise
borders in the Index.

The code as it is, however, would not be too big of a problem if that wasn't
causing us performance problems. But it is, and the majority of it is caused by
the fact that our underlying read_indexes do not know in advance how many bytes
to read, forcing us to do an element-per-element read.

It's time for a change.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-27 16:44:25 +03:00
Pekka Enberg
d63649bf65 cql3: Fix CQL parser error message for counter types
The CQL tokenizer recognizes "COUNTER" token but the parser rule for
counter type is disabled. This causes users to see the following error
in cqlsh, for example:

  CREATE TABLE count (u int PRIMARY KEY, c counter);
  SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message=" : cannot match to any predicted input...  ">

We cannot disable the "COUNTER" token because it's also used in batch
statements. Instead, fix the issue by implementing a stub counter type.

Fixes #195.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 15:34:43 +03:00
Avi Kivity
571e4694c6 Merge "Fix CQL CREATE TABLE compaction strategy support" from Pekka
"This fixes CQL "CREATE TABLE" compaction strategy support to actually
use the configured strategy. Fixes #192."
2015-08-27 14:50:09 +03:00
Pekka Enberg
f7e4be6b4a unimplemented: Move warnings under DEBUG log level
Fixes #196.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 13:44:47 +02:00
Pekka Enberg
5d41eadd6a cql3: Fix 'CREATE TABLE' compaction strategy support
Fix cf_prop_defs::apply_to_builder() to actually apply the given
compaction strategy to the schema that is being created.

Fixes #192.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 14:20:55 +03:00
Pekka Enberg
6b42d0f9b1 compaction_strategy: Fix type() to throw configuration exception
Fix compaction_strategy::type() to throw configuration_exception which
is what Origin throws from CFMetaData.createCompactionStrategy(). This
ensures that the CQL error we send back to the client is the same.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 14:20:55 +03:00
Avi Kivity
16d6daba64 mutation_reader: optimize make_reader_returning()
make_reader_returning() is used by the single-key query path, and is slowed
down by needlessly allocating a vector, which is initialized by copying
the mutation (as initializer_list<> can not be moved from).

Fix by giving it its own implementation instead of relying on
make_reader_returning_many().
2015-08-27 11:52:22 +02:00
Avi Kivity
5f62f7a288 Revert "Merge "Commit log replay" from Calle"
Due to test breakage.

This reverts commit 43a4491043, reversing
changes made to 5dcf1ab71a.
2015-08-27 12:39:08 +03:00
Avi Kivity
28bb65525b Merge "API updates for storage_service" from Asias
"Added APIs like get_release_version, get_operation_mode, is_gossip_running and
so on."
2015-08-27 11:40:48 +03:00
Avi Kivity
0fff367230 Merge "test for compaction metadata's ancestors" from Raphael 2015-08-27 11:07:53 +03:00
Avi Kivity
4e3c9c5493 Merge "compaction manager fixes" from Raphael 2015-08-27 11:05:26 +03:00
Asias He
0fcfc017c8 storage_service: Add debug info for loaded tokens and host ids 2015-08-27 11:01:09 +03:00
Asias He
ad4008d50e gms: Fix release_version
With this patch, the release_version column in system.peers is now correct.
2015-08-27 11:01:08 +03:00
Asias He
80c996a315 db/system_keyspace: Fix get_local_host_id
Before:
host_id in system.local is empty

After:
host_id in system.local is inserted correctly

This fixes a hasty problem that we always get a new host_id when
booting up a node with data.
2015-08-27 11:01:07 +03:00
Avi Kivity
43a4491043 Merge "Commit log replay" from Calle
"Initial implementation/transposition of commit log replay.

* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
  max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
  sstables are inspected for high water mark, and then replayed from
  those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
  per _previous_ runs shards, not current.

Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
  against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
  so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
  like origin. Partly because I am lazy, but also partly because our serial
  format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
  file, detailing which keyspace/cf:s to replay). Partly because we have no
  system properties.

There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
2015-08-27 10:53:36 +03:00
Avi Kivity
5dcf1ab71a Merge seastar upstream
* seastar d1fa2d7...10e09b0 (1):
  > future: balance constructors and destructors in future_state
2015-08-26 23:50:07 +03:00
Avi Kivity
bee26cec12 Merge seastar upstream
* seastar 23f4fae...d1fa2d7 (4):
  > memory: provide some statistic for total memory in debug mode
  > core/thread: Introduce yield()
  > future-util: Move later() into future-util.hh
  > tests: make alloc_test work with many --memory sizes
2015-08-26 21:05:51 +03:00
Avi Kivity
093bf13f4e Merge "SSTables: write performance" from Glauber
"This patchset fixes some small but pertinent issues with perf_sstable_index,
and then goes on to make the sstable buffer size configurable by tests.

With that in place, we can easily run a benchmark across multiple values to
figure out how the system behaves under multiple conditions.

The performance is improved by as much as 5% (smp == 8), although it not always
does so: smp == 12 yields the same for both buf sizes, but has a significant larger
error.

Available at:
    git@github.com:glommer/urchin.git   sstable-write-perf

Benchmark:
==========

run() {
    cpus=$1
    partitions=$2
    for i in 8 16 32 64 128 256; do
        printf "%3dk: " $i;
        ./build/release/tests/perf/perf_sstable_index --smp $cpus --iterations 30 --partitions $partitions --buffer_size $i;
    done
    echo -e "\n"
}

for cpus in 1 2 4 6 8 12; do
    echo "SMP $cpus, 500000 partitions"
    run
done

SMP 1, 500000 partitions
  8k: 325037.27 +- 335.47 partitions / sec (30 runs, 1 concurrent ops)
 16k: 374839.11 +- 400.24 partitions / sec (30 runs, 1 concurrent ops)
 32k: 405940.98 +- 467.75 partitions / sec (30 runs, 1 concurrent ops)
 64k: 428007.61 +- 522.59 partitions / sec (30 runs, 1 concurrent ops)
128k: 436788.19 +- 539.66 partitions / sec (30 runs, 1 concurrent ops)
256k: 442052.82 +- 656.14 partitions / sec (30 runs, 1 concurrent ops)

SMP 2, 500000 partitions
  8k: 569126.58 +- 641.92 partitions / sec (30 runs, 1 concurrent ops)
 16k: 646362.97 +- 987.67 partitions / sec (30 runs, 1 concurrent ops)
 32k: 718711.55 +- 1170.77 partitions / sec (30 runs, 1 concurrent ops)
 64k: 747707.73 +- 1464.18 partitions / sec (30 runs, 1 concurrent ops)
128k: 774929.13 +- 1540.03 partitions / sec (30 runs, 1 concurrent ops)
256k: 786300.94 +- 1552.16 partitions / sec (30 runs, 1 concurrent ops)

SMP 4, 500000 partitions
  8k: 974641.46 +- 1919.13 partitions / sec (30 runs, 1 concurrent ops)
 16k: 1086237.69 +- 2626.61 partitions / sec (30 runs, 1 concurrent ops)
 32k: 1173998.02 +- 4412.38 partitions / sec (30 runs, 1 concurrent ops)
 64k: 1254343.28 +- 5193.97 partitions / sec (30 runs, 1 concurrent ops)
128k: 1272103.63 +- 6710.27 partitions / sec (30 runs, 1 concurrent ops)
256k: 1277801.52 +- 5529.45 partitions / sec (30 runs, 1 concurrent ops)

SMP 6, 500000 partitions
  8k: 1131322.35 +- 3122.81 partitions / sec (30 runs, 1 concurrent ops)
 16k: 1284103.88 +- 4804.79 partitions / sec (30 runs, 1 concurrent ops)
 32k: 1324921.30 +- 5489.99 partitions / sec (30 runs, 1 concurrent ops)
 64k: 1401296.23 +- 5461.20 partitions / sec (30 runs, 1 concurrent ops)
128k: 1459283.89 +- 6674.87 partitions / sec (30 runs, 1 concurrent ops)
256k: 1449591.69 +- 6105.13 partitions / sec (30 runs, 1 concurrent ops)

SMP 8, 500000 partitions
  8k: 1168346.90 +- 3466.96 partitions / sec (30 runs, 1 concurrent ops)
 16k: 1288961.45 +- 3594.02 partitions / sec (30 runs, 1 concurrent ops)
 32k: 1362826.10 +- 5666.47 partitions / sec (30 runs, 1 concurrent ops)
 64k: 1412672.45 +- 4961.73 partitions / sec (30 runs, 1 concurrent ops)
128k: 1489967.90 +- 9373.07 partitions / sec (30 runs, 1 concurrent ops)
256k: 1449589.43 +- 9772.39 partitions / sec (30 runs, 1 concurrent ops)

SMP 12, 500000 partitions
  8k: 1183805.45 +- 3225.67 partitions / sec (30 runs, 1 concurrent ops)
 16k: 1295929.96 +- 9187.47 partitions / sec (30 runs, 1 concurrent ops)
 32k: 1373621.47 +- 10332.86 partitions / sec (30 runs, 1 concurrent ops)
 64k: 1436798.27 +- 12399.21 partitions / sec (30 runs, 1 concurrent ops)
128k: 1438624.45 +- 8824.31 partitions / sec (30 runs, 1 concurrent ops)
256k: 1482219.58 +- 15888.66 partitions / sec (30 runs, 1 concurrent ops)"
2015-08-26 20:52:27 +03:00
Tzach Livyatan
26bdcef325 README: Add recursive to git submodule update command
Recursive update is required for nested submodule

Signed-off-by: Tzach Livyatan <tzach@cloudius-systems.com>
2015-08-26 13:33:40 +03:00
Avi Kivity
9a0d8adc04 Merge "Implement system table accessors for peers table" from Glauber
Fixes #177.
2015-08-26 13:32:22 +03:00
Takuya ASADA
c341579568 dist: add make on build dependency
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-08-26 12:47:36 +03:00
Glauber Costa
391eea564e system_tables: implement load_host_id
A simple translation from the original code.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 19:16:30 -05:00
Glauber Costa
0fd2861293 system_tables: implement load_tokens
A simple translation from the original code

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 19:16:30 -05:00
Glauber Costa
cd5d93fe3a sstables: change default buffer size to 128k
Example run of perf_sstable_index:

 64k: 1401296.23 +- 5461.20 partitions / sec (30 runs, 1 concurrent ops)
128k: 1459283.89 +- 6674.87 partitions / sec (30 runs, 1 concurrent ops)

This is 4 % higher on an 0.45 % error

For larger buffers, like 256k, this doesn't yield a consistent gain, sometimes
yielding a loss.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 18:31:50 -05:00
Glauber Costa
873cf17cf4 sstable tests: allow for the creation of sstables of non-default buffer size.
This can now be used in the sstable_index_write performance test.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 18:31:50 -05:00
Glauber Costa
a5e173ec98 sstable: move buffer size inside the sstable object.
We'll pay the price of having this now as a variable instead of a constexpr,
but this dims in comparison with the rest of the operation.

By paying this cost, we gain the ability of actually specifying it during test
runs, making it easy to automate scripts that will measure the performance over
various buffer sizes.

I am also providing a new constructor that allows for the setting of the buffer
size.  The said constructor will be private, meaning that only the test class
will be able to use it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 18:31:50 -05:00
Glauber Costa
f4d8310d88 perf_sstable_index: calculate time spent before the map reduce operation.
Not doing that will include the smp communication costs in the total cost of
the operation. This will not very significant when comparing one run against
the other when the results clearly differ, but the proposed way yields error
figures that are much lower. So results are generally better.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 18:31:49 -05:00
Glauber Costa
19d25130af perf_sstable_index: make parallelism an explicit option
As we have discussed recently, the sstable writer can't even handle intra-core
parallelism - it has only one writer thread per core, and for reads, it affects
the final throughput a lot.

We don't want to get rid of it, because in real scenarios intra-core
parallelism will be there, specially for reads. So let's make it a tunable so we
can easily test its effect on the final result.

The iterations are now all sequential, and we will run x parallel invocation at
each of them.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-25 18:31:49 -05:00