Commit Graph

6076 Commits

Author SHA1 Message Date
Pekka Enberg
ae9e3e049c schema: Improve column_definition operator<< output
Make operator<< for column_definition print more information.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:26 +03:00
Pekka Enberg
61d7e8de1c schema: Add to_string() for column_kind and index_type enums
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:26 +03:00
Pekka Enberg
03e0bcd8cb database: Add operator<< for keyspace_metadata
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:19 +03:00
Nadav Har'El
f6ae567ab1 repair: implement primaryRange and ranges options
This patch implements repair's "primaryRange" and "ranges" options:

Without these options, a repair defaults to repair all the ranges for which
this nodes holds a replica (each range is repaired by contacting the other
replicas of this range).

If the "primaryRange" option is passed, instead of repairing all ranges, only
the "primary ranges" of this node is repaired - for each range, only one node
has this range as its "primary range". The intention is that a user can start
a "primaryRange" repair on all nodes, and the result would be that each range
will only be repaired once.

If the "ranges" option is passed, it can explicitly list a list of ranges to
repair, overriding the automatic determination of ranges explained above.

Fixes #212.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-31 10:02:03 +03:00
Nadav Har'El
cc4117d6c1 repair: do not use an atomic integer
Avi asked not to use an atomic integer to produce ids for repair
operations. The existing code had another bug: It could return some
id immediately, but because our start_repair() hasn't started running
code on cpu 0 yet, the new id was not yet registered and if we were to
call repair_get_status() for this id too quickly, it could fail.

The solution for both issues is that start_repair() should return not
an int, but a future<int>: the integer id is incremented on cpu 0 (so
no atomics are needed), and then returned and the future is fulfilled.

Note that the future returned by start_repair() does not wait for the
repair to be over - just for its index to be registered and be usable
to a call to repair_get_status().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-31 09:31:19 +03:00
Gleb Natapov
821d81786e fix timeout of background read repair request
Do not set _cl_promise on timeout if timeout happens after cl is
achieved. It may happen for background read repair requests.
2015-08-30 19:07:29 +03:00
Gleb Natapov
5bb37bc92e fix race between speculating read timer and request completion
Speculating timer may expire after request is complete, but before a
continuation that cancels it runs. In this case the timer should not
initiate additional request and just do nothing instead.
2015-08-30 19:07:29 +03:00
Avi Kivity
2ef5816996 Merge seastar upstream
* seastar a503442...9cc5cd0 (3):
  > fstream: fix write-behind on filesystems that don't support fallocate()
  > fstream: return correct error
  > fstream: reinitialize _background_writes_done after an error
2015-08-30 15:18:28 +03:00
Avi Kivity
4ec4a4b53c Merge seastar upstream
* seastar 2e041c2...a503442 (4):
  > fstream: write-behind
  > output_stream: improve flush() support
  > thread: initialize stack in debug mode
  > sharded: do not capture remote service pointer on remote invocation lambda
2015-08-30 12:09:51 +03:00
Avi Kivity
554645db91 Revert "Merge "Move the API configuration from command line to configuration" from Amnon"
See issue #59 for details.

This reverts commit 5aa0244d32, reversing
changes made to 7fb109a58d.
2015-08-30 12:09:00 +03:00
Avi Kivity
15987f80cf Merge "Avoid allocations in the read indexes path" from Glauber
"We can avoid small allocations when doing read_index. Doing that will yield
us another 4 % gain.

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)"
2015-08-30 08:43:18 +03:00
Glauber Costa
b1bfcda38c column helper: loop once only while gathering statistics.
the code to gather statistics about the column_name is showing in the
benchmark.

If we really want to collect those statistics, I guess they will never be free
because they involve a byte copy which implies an allocation.

But one easy thing we can do to make it better, is collect both min and max
statistics in the same loop. There is also no need to special case the case of
an empty vector, since may_grow will already take care of that.

That yields us a ~ 0.77 % boost, which although not earth shattering, is easy
enough for us not to reap.

Before:
200582.94 +- 293.91 partitions / sec (30 runs, 1 concurrent ops)

After:
202120.06 +- 341.95 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-30 08:43:02 +03:00
Glauber Costa
aab1ae9dc1 index_entry: don't generate a temporary bytes element
The one thing that is still showing pretty high at the read_indexes flamegraph,
is allocations.

We can, however, do better. Since most of the index is the keys anyway - and we need
all of them, the amount of memory we use by copying the buffers over is about the same
as the space we would use by just keeping the buffers around.

So we can change index_entry to just keep the shared_buffers, and since we always access
it through views anyway, that is perfectly fine. The index_entry destructor will then
release() the temporary_buffer, instead of doing this after the buffer copy.

This gives us a nice additional 4 %.

perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode index_read

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:09:53 -05:00
Glauber Costa
a9ab31dd9c index_entry: move its fields to private visibility
And provide accessors. This will give us the freedom to change their internal
storage.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
1fbd14354f index_entry: provide a constructor
This is a preparation to have their internal fields as private.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
13d59c9618 index_entry: do away with the disk_string<> fields
Now that we are using the NSM, and not the general parser for the index, there
is no reason to keep using disk_string<>s in it. Since it is staying in the way
of further optimizations, let's get rid of it and use bytes directly.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
b53511b422 sstables: don't return after processing collections
The code as is is blatantly wrong, and is an artifact of the seastar-thread
conversion.

This happened because the way we move to the next element in a do_for_each
future loop, is by returning the current lambda, and so it was converted this
way. Since we are now using a for loop, we should not return: we should continue.

I found this while searching for a bug, which is unfortunately not fixed by this.
But this is totally wrong, and has to be fixed.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 20:37:39 +03:00
Glauber Costa
2623362d20 continuous_data_consumer: do not pass reference to child
Since the child is a base class, we don't need to pass a reference: we can
just cast our 'this' pointer.

By doing that, the move constructor can come back.

Welcome back, move constructor.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 20:32:56 +03:00
Avi Kivity
5aa0244d32 Merge "Move the API configuration from command line to configuration" from Amnon
"This series address issues #59 and #23.

It moves the API configuration from the command line argument to the general
config, it also move the api-doc directory to be configurable instead of hard
coded."

Fixes #59
Fixes #23
2015-08-29 12:34:04 +03:00
Avi Kivity
7fb109a58d Merge "Types cleanup" from Pekka
"Remove type name duplication in types.cc."
2015-08-29 11:48:41 +03:00
Glauber Costa
0dd57fbca8 checksummed file writer: some cleanups
- no need to mark us as a friend of file_writer
- should be constructing the fields directly instead of using the constructors body.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 11:44:48 +03:00
Glauber Costa
66cc546781 sstable writer: compute checksum at larger chunks
What we are doing now, is computing checksum at every write() operation, possibly
at a small byte quantity - like 2 or 4 bytes, since we write those a lot as sizes.

While adler32 allows those computations and make them very easy, that doesn't mean
they are efficient. It is a lot more efficient to compute the checksum on larger
buffer.

We can do that by doing it at put() time in a data_sink_impl, instead of
keeping that in the file abstraction. The code for the checksum itself now also
becomes remarkably simpler - since there is no need anymore to keep state:
we'll always be presented with full buffers.

The data sink implementation and the file_writer share the full_checksum and
the checksum struct variables: and with that in place, the file writer can
still expose the final results of the computation in the same way it does at
present.

Benchmarked with:
perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode write --num_columns 5 --partitions 500000

Before:
178829.07 +- 141.28 partitions / sec (30 runs, 1 concurrent ops)
After:
199744.71 +- 201.64 partitions / sec (30 runs, 1 concurrent ops)

gain: 11.70 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 11:44:47 +03:00
Avi Kivity
e9917a5862 Merge "Improve read index performance further" from Glauber
"This patch improves the read_indexes performance by an extra 16 %.
The total gain so far is now 98 %, and although there are still things
I believe we can do to improve it further, I consider a 2-fold increase
sufficient to declare Issue #94 fixed.

So:

Fixes #94

The speed up is achieved by converting the reader to the NSM. To do that, I had
to commonize most parts of the NSM. I had attempted this before, and for this
new cycle, I had a new tool to aid me in this task: the sstable performance
microbenchmark.

Every change to the NSM was individually tested to make sure the performance
of the read path was not regressing. When it did regress, I took alternate
approaches and tried my best to discuss the whys in the changelogs, with
the appropriate result.

So I can be quite confident in affirming that we are not taking any drop
here, while read_index performance is increased significantly"
2015-08-29 11:28:03 +03:00
Amnon Heiman
f1cda74c15 API: storage_service - return an error for wrong keyspace name
This patch addresses issu #155, it adds a helper function that if a
keyspace does not exists it throw a bad parameter exception.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-29 11:22:27 +03:00
Glauber Costa
babccb1112 read_indexes: convert to the NSM
Reading each member individually is not as efficient. Better convert to
the NSM.

Before:
717101.20 +- 649.77 partitions / sec (30 runs, 1 concurrent ops)
After:
838169.80 +- 575.04 partitions / sec (30 runs, 1 concurrent ops)

Gains:
16.88 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 19:07:39 -05:00
Glauber Costa
4b174c754d commonize the NSM
In order to reuse the NSM in other scenarios, we need to push as much code
as possible into a common class.

This patch does that, making the continuous_data_consumer class now the main
placeholder for the NSM class. The actual readers will have to inherit from it.

However, despite using inheritance, I am not using virtual functions at all
instead, we let the continuous_data_consumer receive an instance of the derived
class, and then it can safely call its methods without paying the cost of
virtual functions.

In other attempt, I had kept the main process() function in the derived class,
that had the responsibility of then coding the loop.

With the use of the new pattern, we can keep the loop logic in the base class,
which is a lot cleaner. There is a performance penalty associated with it, but
it is fairly small: 0.5 % in the sequential_read perf_sstable test. I think we
can live with it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 18:56:26 -05:00
Glauber Costa
f8d35ef5ec sstables: move exception to its own file.
I am moving the malformed exception here, to avoid circular dependencies.
But since the file now exists, let's move them all.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:44 -05:00
Glauber Costa
d9b7f4bde3 row consumer: separate processing of buffers from the main loop
In my previous attempt, I have separated the state processor for the main loop,
leaving that to be filled by a derived class.

That felt a lot more natural, because then we don't have to replicate the loop
logic in the derived classes.

But well, oh, well, life is hard. Specially on fast paths. Doing that makes us
insert an extra call in this loop, and that is noticeable: we would be 1.5 %
slower, and that is not even counting the cost of making the state processing a
virtual function later on.

I could just argue that this is acceptable due to decoupling gains, but why I
would argue that, if I can just rewrite it in a way that no performance is
lost?

And then I did. The disadvantage of this, is that the derived class will now
have to re-code the loop, but no performance is lost. Another advantage of
this, is that the derived class will now be able to call into process_buffer
directly, without using virtual functions in this path for any of them.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:44 -05:00
Glauber Costa
fbd68c3b01 row consumer: move consume_be to consumer.hh
It will be reused by the continuous_data_consumer

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:43 -05:00
Glauber Costa
e1945e473b row consumer: make non_consuming an instance member
It is now a static member that gets the instance members as parameters.  There
is no reason for that, and this will complicate the decoupling, since the
prestate reader won't know about state.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 17:18:19 -05:00
Glauber Costa
f45b807f34 row consumer: move proceed class to a separate class
Continuing the work of decoupling the the prestate and state parts of the NSM
so we can reuse it, move the proceed class to a different holding class.
Proceeding or not has nothing to do with "rows".

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 17:18:06 -05:00
Glauber Costa
49ac04a60a row consumer: fall through more often
Because we didn't had before a way to know whether or not the read completed,
we would always go back to the main loop, and would only optimize sequential
reads for some kinds of data.

However, As one could see in the previous patch, the new read_X functions will
notify completion, allowing us to just fallthrough to the next case if that is
the only possibility. In most cases, it isn't. With this, we can apply this
optimization throughout all cases where we don't branch states, and with a very
elegant resulting code.

The performance actually increases by 0.75 %. It is not much, but it is more
than the error margin (which sits at 0.20 %), and because the code is not made
unreadable by it, this is a clear win to me.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
1f930cda4a row consumer: extend use of read for multi-value fields
In an attempt to gain some cycles, we are testing whether we can read many
values at once, and if so, using consume_be directly for those.

What we can do in this situation, is read the first value, and let the read
fall through the next case if the read succeeds.

The code actually looks a lot more elegant this way.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
0ad8afb0ec row consumer: extend usage of the read_* functions
In some places, we cannot use our read_* functions, because we don't know
whether or not it succeeded, and that is important when passing the state
along.

The fix for this is trivial, since we can just return it from the reader.

Note for reviewers: The commend in one of the functions say we should use:
"read_bytes(data, _u32, _key ...". But in the actual code, the where buffer is
_val, not _key.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
13af0ffbd2 row consumer: fix read_bytes temporary len
It shouldn't be _u16, but rather whatever we passed as len. It currently works
because all callers pass _u16 as len. But this will soon change.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
62a26ef411 row consumer: don't switch state implicitly
Soon enough, all the state machine will be separated from the prestate handling.
To make it easier, we will decouple them as much as we can.

Not automatically switching states in the read functions is part of this.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Avi Kivity
c734ef2b72 Merge seastar upstream
* seastar 10e09b0...2e041c2 (7):
  > Merge "Change app_template::run() to terminate when callback is done" from Tomasz
  > resource: Fix compilation for hwloc version 1.8.0
  > memory: Fix infinite recursion when throwing std::bad_alloc
  > core/reactor: Throw the right error code when connect() fails
  > future: improve exception safety
  > xen: add missing virtual destructors
  > circular_buffer: do not destroy uninitialized object

app_template::run() users updated to call app_template::run_depracated().
2015-08-28 23:52:49 +03:00
Amnon Heiman
800578f164 API: Take the API doc directory from configuration
The API doc directory will now be taken from configuration instead of
been hard coded.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-28 20:26:30 +03:00
Amnon Heiman
9ef7d1ee69 main: Take the http configuration from the configuration object
This replaces the http configuration to use the general configuration
object instead of the command line argument. This will allow to
configure the API from configuration file and not just from the command
line.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-28 20:24:59 +03:00
Amnon Heiman
dd77f7e288 configuration: Add the API configuration to the general configuration
This adds the API configuration parameters to the configurtion, so it
will be possible to take them from the configuration file or from the
command line.

The following configuration were defined:
api_port
api_addres
api_ui_dir
api_doc_dir

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-28 20:22:55 +03:00
Amnon Heiman
7b1c973884 API: Add doc directory parameter to the http context
Adding a parameter to the http context so it will not be hard coded and
could be configured.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-28 20:20:20 +03:00
Avi Kivity
012fd41fc0 db: hard dirty memory limit
Unlike cache, dirty memory cannot be evicted at will, so we must limit it.

This patch establishes a hard limit of 50% of all memory.  Above that,
new requests are not allowed to start.  This allows the system some time
to clean up memory.

Note that we will need more fine-grained bandwidth control than this;
the hard limit is the last line of defense against running our of reclaimable
memory.

Tested with a mixed read/write load; after reads start to dominate writes
(due to the proliferation of small sstables, and the inability of compaction
to keep up, dirty memory usage starts to climb until the hard stop prevents
it from climbing further and ooming the server).
2015-08-28 14:47:17 +02:00
Pekka Enberg
78b8ca1a2c types: Unify type names
Fix duplicate type names in the types map and the classes themselves.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-28 14:39:46 +03:00
Pekka Enberg
dfbf84ce18 types: Introduce ascii_type_impl and utf8_type_impl classes
In preparation for reducing type name duplication, introduce classes for
ascii and utf8 types.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-28 14:01:55 +03:00
Avi Kivity
f171d71c16 utils: optimize murmur3_hash data fetch
By using a recognized idiom, gcc can optimize the unaligned little endian
load as a single instruction (actually less than an instruction, as it
combines it with a succeeding arithmetic operation).
2015-08-28 12:37:43 +03:00
Avi Kivity
cb1372523a Merge "CQL code cleanups" from Pekka
"Here's another round of cleanups to the CQL code. Nothing exciting here,
mostly moving code to source files which makes changing the code less
painful in terms of compilation times."
2015-08-27 18:32:45 +03:00
Pekka Enberg
28aad6fa67 cql3: Move ks_prop_defs implementation to source file
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 18:16:28 +03:00
Avi Kivity
7e8c6eddbb Merge "Buffer related read performance improvement" from Glauber
"As we could see, the flamegraphs shows a lot of performance still left in the
table.  However, from the I/O point of view, we have determined through our
write performance testing, that 128k is the sweet spot for buffers. Worse yet:
reads are still trapped at 8k.

While it is true that when we want to read just a little data, smaller is
better, it is also true that reads (and now that includes the index), tend to
give hints about the size they want read.

So we can read the whole thing at once if smaller than 128k, or chop it at 128k
increments if they are not.

The performance gains coming from doing this are considerable: 39 % for data,
67 % for index."
2015-08-27 18:07:27 +03:00
Pekka Enberg
c2ff7b67ce cql3: Move user_types implementation to source file
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-27 17:50:54 +03:00
Avi Kivity
cf0825182e Merge "New modes for sstable perf tests" from Glauber
"index_read, sequential_read, and write"
2015-08-27 17:26:42 +03:00