Commit Graph

6092 Commits

Author SHA1 Message Date
Calle Wilund
bbf82e80d0 Commitlog: Allow skipping X bytes in commit log reader
Also refactor reader into named methods for debugging sanity.
2015-08-31 14:29:49 +02:00
Calle Wilund
da9ea641e5 Commitlog: Handle full paths in descriptor file name parse. 2015-08-31 14:29:48 +02:00
Calle Wilund
02d2bef1f2 Commitlog: Expose convinience method "list_existing_segments" 2015-08-31 14:29:48 +02:00
Calle Wilund
19052b3c09 Commitlog: Expose list_existing_descriptors 2015-08-31 14:29:48 +02:00
Calle Wilund
e068ffb5a5 Commitlog: Make file reader provide replay_position for entries 2015-08-31 14:29:47 +02:00
Calle Wilund
41b1ad8600 Commitlog: Make descriptor type visible/usable from outside 2015-08-31 14:29:47 +02:00
Calle Wilund
f14e3cf8d0 Database: do not create shard-specific dirs for commitlog
New ID scheme allows for a single dir for all segments from all shards.
2015-08-31 14:29:46 +02:00
Calle Wilund
ea38b223bd Commitlog: change the ID generation scheme
* Make it more like origin, i.e. based on wall clock time of app start
* Encode shard ID in the, RP segement ID, to ensure RP:s and segement names
  are unique per shard
2015-08-31 14:29:46 +02:00
Calle Wilund
4ac07fa87d Commitlog test: remove some hardcoded assumptions on segment IDs
To enable changing the ID generation scheme.
2015-08-31 14:29:45 +02:00
Calle Wilund
c040565bf9 runtime: expose boot_time
(boot == app start, I did not rename the var).
2015-08-31 14:29:45 +02:00
Calle Wilund
d4ae43862d SStables: Use db::commitlog::replay_position (not own type) 2015-08-31 14:29:45 +02:00
Calle Wilund
0fcf7e3e91 Commitlog: Make "position" type 32-bit to align replay_position with
Origin

* Note: removed commitlog_test:test_allocation_failure because with 
  segments limited to 4GB -> mutation limited to 2GB, actually forcing
  a fail is not guaranteed or even likely.
2015-08-31 14:29:44 +02:00
Calle Wilund
3f1a91b89c Commitlog: do not eagerly create first segment on init
Deferring makes it easier to separate old segments from new, which in turn
helps replay logic.
2015-08-31 13:11:44 +02:00
Tomasz Grabiec
fabda26871 Merge branch 'penberg/fix-create-keyspace-validation/v1' from seastar-dev.git
From Pekka:

This adds replication strategy validation and wires it up to CREATE
KEYSPACE via migration manager. Fixes #191.
2015-08-31 12:21:52 +02:00
Pekka Enberg
01900996b9 service/migration_manager: Wire up create keyspace validation
Fixes #191.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
04a65ec06f database: Add keyspace_metadata::validate() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
5a9cff9dc0 locator/abstract_replication_strategy: Add validate_replication_strategy() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
b8211c436b locator/abstract_replication_strategy: Add validate_options() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
cff9eb520b locator/abstract_replication_strategy: Add recognized_options() helper
Add a helper function for obtaining a vector of supported replication
strategy options. This is needed for validation.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:31 +03:00
Nadav Har'El
f6ae567ab1 repair: implement primaryRange and ranges options
This patch implements repair's "primaryRange" and "ranges" options:

Without these options, a repair defaults to repair all the ranges for which
this nodes holds a replica (each range is repaired by contacting the other
replicas of this range).

If the "primaryRange" option is passed, instead of repairing all ranges, only
the "primary ranges" of this node is repaired - for each range, only one node
has this range as its "primary range". The intention is that a user can start
a "primaryRange" repair on all nodes, and the result would be that each range
will only be repaired once.

If the "ranges" option is passed, it can explicitly list a list of ranges to
repair, overriding the automatic determination of ranges explained above.

Fixes #212.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-31 10:02:03 +03:00
Nadav Har'El
cc4117d6c1 repair: do not use an atomic integer
Avi asked not to use an atomic integer to produce ids for repair
operations. The existing code had another bug: It could return some
id immediately, but because our start_repair() hasn't started running
code on cpu 0 yet, the new id was not yet registered and if we were to
call repair_get_status() for this id too quickly, it could fail.

The solution for both issues is that start_repair() should return not
an int, but a future<int>: the integer id is incremented on cpu 0 (so
no atomics are needed), and then returned and the future is fulfilled.

Note that the future returned by start_repair() does not wait for the
repair to be over - just for its index to be registered and be usable
to a call to repair_get_status().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-31 09:31:19 +03:00
Gleb Natapov
821d81786e fix timeout of background read repair request
Do not set _cl_promise on timeout if timeout happens after cl is
achieved. It may happen for background read repair requests.
2015-08-30 19:07:29 +03:00
Gleb Natapov
5bb37bc92e fix race between speculating read timer and request completion
Speculating timer may expire after request is complete, but before a
continuation that cancels it runs. In this case the timer should not
initiate additional request and just do nothing instead.
2015-08-30 19:07:29 +03:00
Avi Kivity
2ef5816996 Merge seastar upstream
* seastar a503442...9cc5cd0 (3):
  > fstream: fix write-behind on filesystems that don't support fallocate()
  > fstream: return correct error
  > fstream: reinitialize _background_writes_done after an error
2015-08-30 15:18:28 +03:00
Avi Kivity
4ec4a4b53c Merge seastar upstream
* seastar 2e041c2...a503442 (4):
  > fstream: write-behind
  > output_stream: improve flush() support
  > thread: initialize stack in debug mode
  > sharded: do not capture remote service pointer on remote invocation lambda
2015-08-30 12:09:51 +03:00
Avi Kivity
554645db91 Revert "Merge "Move the API configuration from command line to configuration" from Amnon"
See issue #59 for details.

This reverts commit 5aa0244d32, reversing
changes made to 7fb109a58d.
2015-08-30 12:09:00 +03:00
Avi Kivity
15987f80cf Merge "Avoid allocations in the read indexes path" from Glauber
"We can avoid small allocations when doing read_index. Doing that will yield
us another 4 % gain.

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)"
2015-08-30 08:43:18 +03:00
Glauber Costa
b1bfcda38c column helper: loop once only while gathering statistics.
the code to gather statistics about the column_name is showing in the
benchmark.

If we really want to collect those statistics, I guess they will never be free
because they involve a byte copy which implies an allocation.

But one easy thing we can do to make it better, is collect both min and max
statistics in the same loop. There is also no need to special case the case of
an empty vector, since may_grow will already take care of that.

That yields us a ~ 0.77 % boost, which although not earth shattering, is easy
enough for us not to reap.

Before:
200582.94 +- 293.91 partitions / sec (30 runs, 1 concurrent ops)

After:
202120.06 +- 341.95 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-30 08:43:02 +03:00
Glauber Costa
aab1ae9dc1 index_entry: don't generate a temporary bytes element
The one thing that is still showing pretty high at the read_indexes flamegraph,
is allocations.

We can, however, do better. Since most of the index is the keys anyway - and we need
all of them, the amount of memory we use by copying the buffers over is about the same
as the space we would use by just keeping the buffers around.

So we can change index_entry to just keep the shared_buffers, and since we always access
it through views anyway, that is perfectly fine. The index_entry destructor will then
release() the temporary_buffer, instead of doing this after the buffer copy.

This gives us a nice additional 4 %.

perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode index_read

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:09:53 -05:00
Glauber Costa
a9ab31dd9c index_entry: move its fields to private visibility
And provide accessors. This will give us the freedom to change their internal
storage.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
1fbd14354f index_entry: provide a constructor
This is a preparation to have their internal fields as private.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
13d59c9618 index_entry: do away with the disk_string<> fields
Now that we are using the NSM, and not the general parser for the index, there
is no reason to keep using disk_string<>s in it. Since it is staying in the way
of further optimizations, let's get rid of it and use bytes directly.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
b53511b422 sstables: don't return after processing collections
The code as is is blatantly wrong, and is an artifact of the seastar-thread
conversion.

This happened because the way we move to the next element in a do_for_each
future loop, is by returning the current lambda, and so it was converted this
way. Since we are now using a for loop, we should not return: we should continue.

I found this while searching for a bug, which is unfortunately not fixed by this.
But this is totally wrong, and has to be fixed.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 20:37:39 +03:00
Glauber Costa
2623362d20 continuous_data_consumer: do not pass reference to child
Since the child is a base class, we don't need to pass a reference: we can
just cast our 'this' pointer.

By doing that, the move constructor can come back.

Welcome back, move constructor.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 20:32:56 +03:00
Avi Kivity
5aa0244d32 Merge "Move the API configuration from command line to configuration" from Amnon
"This series address issues #59 and #23.

It moves the API configuration from the command line argument to the general
config, it also move the api-doc directory to be configurable instead of hard
coded."

Fixes #59
Fixes #23
2015-08-29 12:34:04 +03:00
Avi Kivity
7fb109a58d Merge "Types cleanup" from Pekka
"Remove type name duplication in types.cc."
2015-08-29 11:48:41 +03:00
Glauber Costa
0dd57fbca8 checksummed file writer: some cleanups
- no need to mark us as a friend of file_writer
- should be constructing the fields directly instead of using the constructors body.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 11:44:48 +03:00
Glauber Costa
66cc546781 sstable writer: compute checksum at larger chunks
What we are doing now, is computing checksum at every write() operation, possibly
at a small byte quantity - like 2 or 4 bytes, since we write those a lot as sizes.

While adler32 allows those computations and make them very easy, that doesn't mean
they are efficient. It is a lot more efficient to compute the checksum on larger
buffer.

We can do that by doing it at put() time in a data_sink_impl, instead of
keeping that in the file abstraction. The code for the checksum itself now also
becomes remarkably simpler - since there is no need anymore to keep state:
we'll always be presented with full buffers.

The data sink implementation and the file_writer share the full_checksum and
the checksum struct variables: and with that in place, the file writer can
still expose the final results of the computation in the same way it does at
present.

Benchmarked with:
perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode write --num_columns 5 --partitions 500000

Before:
178829.07 +- 141.28 partitions / sec (30 runs, 1 concurrent ops)
After:
199744.71 +- 201.64 partitions / sec (30 runs, 1 concurrent ops)

gain: 11.70 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 11:44:47 +03:00
Avi Kivity
e9917a5862 Merge "Improve read index performance further" from Glauber
"This patch improves the read_indexes performance by an extra 16 %.
The total gain so far is now 98 %, and although there are still things
I believe we can do to improve it further, I consider a 2-fold increase
sufficient to declare Issue #94 fixed.

So:

Fixes #94

The speed up is achieved by converting the reader to the NSM. To do that, I had
to commonize most parts of the NSM. I had attempted this before, and for this
new cycle, I had a new tool to aid me in this task: the sstable performance
microbenchmark.

Every change to the NSM was individually tested to make sure the performance
of the read path was not regressing. When it did regress, I took alternate
approaches and tried my best to discuss the whys in the changelogs, with
the appropriate result.

So I can be quite confident in affirming that we are not taking any drop
here, while read_index performance is increased significantly"
2015-08-29 11:28:03 +03:00
Amnon Heiman
f1cda74c15 API: storage_service - return an error for wrong keyspace name
This patch addresses issu #155, it adds a helper function that if a
keyspace does not exists it throw a bad parameter exception.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-29 11:22:27 +03:00
Glauber Costa
babccb1112 read_indexes: convert to the NSM
Reading each member individually is not as efficient. Better convert to
the NSM.

Before:
717101.20 +- 649.77 partitions / sec (30 runs, 1 concurrent ops)
After:
838169.80 +- 575.04 partitions / sec (30 runs, 1 concurrent ops)

Gains:
16.88 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 19:07:39 -05:00
Glauber Costa
4b174c754d commonize the NSM
In order to reuse the NSM in other scenarios, we need to push as much code
as possible into a common class.

This patch does that, making the continuous_data_consumer class now the main
placeholder for the NSM class. The actual readers will have to inherit from it.

However, despite using inheritance, I am not using virtual functions at all
instead, we let the continuous_data_consumer receive an instance of the derived
class, and then it can safely call its methods without paying the cost of
virtual functions.

In other attempt, I had kept the main process() function in the derived class,
that had the responsibility of then coding the loop.

With the use of the new pattern, we can keep the loop logic in the base class,
which is a lot cleaner. There is a performance penalty associated with it, but
it is fairly small: 0.5 % in the sequential_read perf_sstable test. I think we
can live with it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 18:56:26 -05:00
Glauber Costa
f8d35ef5ec sstables: move exception to its own file.
I am moving the malformed exception here, to avoid circular dependencies.
But since the file now exists, let's move them all.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:44 -05:00
Glauber Costa
d9b7f4bde3 row consumer: separate processing of buffers from the main loop
In my previous attempt, I have separated the state processor for the main loop,
leaving that to be filled by a derived class.

That felt a lot more natural, because then we don't have to replicate the loop
logic in the derived classes.

But well, oh, well, life is hard. Specially on fast paths. Doing that makes us
insert an extra call in this loop, and that is noticeable: we would be 1.5 %
slower, and that is not even counting the cost of making the state processing a
virtual function later on.

I could just argue that this is acceptable due to decoupling gains, but why I
would argue that, if I can just rewrite it in a way that no performance is
lost?

And then I did. The disadvantage of this, is that the derived class will now
have to re-code the loop, but no performance is lost. Another advantage of
this, is that the derived class will now be able to call into process_buffer
directly, without using virtual functions in this path for any of them.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:44 -05:00
Glauber Costa
fbd68c3b01 row consumer: move consume_be to consumer.hh
It will be reused by the continuous_data_consumer

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:43 -05:00
Glauber Costa
e1945e473b row consumer: make non_consuming an instance member
It is now a static member that gets the instance members as parameters.  There
is no reason for that, and this will complicate the decoupling, since the
prestate reader won't know about state.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 17:18:19 -05:00
Glauber Costa
f45b807f34 row consumer: move proceed class to a separate class
Continuing the work of decoupling the the prestate and state parts of the NSM
so we can reuse it, move the proceed class to a different holding class.
Proceeding or not has nothing to do with "rows".

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-28 17:18:06 -05:00
Glauber Costa
49ac04a60a row consumer: fall through more often
Because we didn't had before a way to know whether or not the read completed,
we would always go back to the main loop, and would only optimize sequential
reads for some kinds of data.

However, As one could see in the previous patch, the new read_X functions will
notify completion, allowing us to just fallthrough to the next case if that is
the only possibility. In most cases, it isn't. With this, we can apply this
optimization throughout all cases where we don't branch states, and with a very
elegant resulting code.

The performance actually increases by 0.75 %. It is not much, but it is more
than the error margin (which sits at 0.20 %), and because the code is not made
unreadable by it, this is a clear win to me.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
1f930cda4a row consumer: extend use of read for multi-value fields
In an attempt to gain some cycles, we are testing whether we can read many
values at once, and if so, using consume_be directly for those.

What we can do in this situation, is read the first value, and let the read
fall through the next case if the read succeeds.

The code actually looks a lot more elegant this way.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00
Glauber Costa
0ad8afb0ec row consumer: extend usage of the read_* functions
In some places, we cannot use our read_* functions, because we don't know
whether or not it succeeded, and that is important when passing the state
along.

The fix for this is trivial, since we can just return it from the reader.

Note for reviewers: The commend in one of the functions say we should use:
"read_bytes(data, _u32, _key ...". But in the actual code, the where buffer is
_val, not _key.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 16:30:22 -05:00