Commit Graph

53948 Commits

Author SHA1 Message Date
Calle Wilund
f14e3cf8d0 Database: do not create shard-specific dirs for commitlog
New ID scheme allows for a single dir for all segments from all shards.
2015-08-31 14:29:46 +02:00
Calle Wilund
ea38b223bd Commitlog: change the ID generation scheme
* Make it more like origin, i.e. based on wall clock time of app start
* Encode shard ID in the, RP segement ID, to ensure RP:s and segement names
  are unique per shard
2015-08-31 14:29:46 +02:00
Calle Wilund
4ac07fa87d Commitlog test: remove some hardcoded assumptions on segment IDs
To enable changing the ID generation scheme.
2015-08-31 14:29:45 +02:00
Calle Wilund
c040565bf9 runtime: expose boot_time
(boot == app start, I did not rename the var).
2015-08-31 14:29:45 +02:00
Calle Wilund
d4ae43862d SStables: Use db::commitlog::replay_position (not own type) 2015-08-31 14:29:45 +02:00
Calle Wilund
0fcf7e3e91 Commitlog: Make "position" type 32-bit to align replay_position with
Origin

* Note: removed commitlog_test:test_allocation_failure because with 
  segments limited to 4GB -> mutation limited to 2GB, actually forcing
  a fail is not guaranteed or even likely.
2015-08-31 14:29:44 +02:00
Avi Kivity
8c69098c89 Merge "Optimize memtable's scanning_reader" from Tomasz
"I saw about 4% improvement in perf_sstable write on muninn with this. The
decorated_key comparison is gone from the perf profile now. Now most of the
work inside the reader is for copying the mutation."
2015-08-31 15:07:27 +03:00
Tomasz Grabiec
f4038b1c04 memtable: scanning_reader: Avoid lookups when iterators not invalidated
Fixes #230.
2015-08-31 13:58:42 +02:00
Tomasz Grabiec
110a55886c lsa: Introduce region::compaction_counter() 2015-08-31 13:58:42 +02:00
Tomasz Grabiec
9ad3dbe592 lsa: Add region::compaction_enabled() 2015-08-31 13:58:42 +02:00
Tomasz Grabiec
048387782a lsa: Rename region::set_compactible() to set_compaction_enabled()
To avoid confusion with region_impl::is_compactible() when the getter
is added.
2015-08-31 13:58:42 +02:00
Tomasz Grabiec
d9ce307c6a memtable: Add non-const partition_entry::key() variant
Helps moving from memtable to cache.
2015-08-31 14:54:26 +03:00
Avi Kivity
0558a56ab5 transport: improve exception handling on connection close
An exception would cause us not to wait on the _ready_to_respond future,
causing a use-after-free.
2015-08-31 14:45:11 +03:00
Calle Wilund
3f1a91b89c Commitlog: do not eagerly create first segment on init
Deferring makes it easier to separate old segments from new, which in turn
helps replay logic.
2015-08-31 13:11:44 +02:00
Pekka Enberg
dd1126f2c2 service/migration_manager: Fix keyspace and CF creation logging
Fix logging output to match origin. Fixes #112.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:36:02 +03:00
Pekka Enberg
7c9eeb519a schema: Add operator<< for 'schema'
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:26 +03:00
Pekka Enberg
ae9e3e049c schema: Improve column_definition operator<< output
Make operator<< for column_definition print more information.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:26 +03:00
Pekka Enberg
61d7e8de1c schema: Add to_string() for column_kind and index_type enums
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:26 +03:00
Pekka Enberg
03e0bcd8cb database: Add operator<< for keyspace_metadata
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:19 +03:00
Tomasz Grabiec
fabda26871 Merge branch 'penberg/fix-create-keyspace-validation/v1' from seastar-dev.git
From Pekka:

This adds replication strategy validation and wires it up to CREATE
KEYSPACE via migration manager. Fixes #191.
2015-08-31 12:21:52 +02:00
Pekka Enberg
01900996b9 service/migration_manager: Wire up create keyspace validation
Fixes #191.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
04a65ec06f database: Add keyspace_metadata::validate() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
5a9cff9dc0 locator/abstract_replication_strategy: Add validate_replication_strategy() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
b8211c436b locator/abstract_replication_strategy: Add validate_options() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Pekka Enberg
cff9eb520b locator/abstract_replication_strategy: Add recognized_options() helper
Add a helper function for obtaining a vector of supported replication
strategy options. This is needed for validation.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:31 +03:00
Nadav Har'El
f6ae567ab1 repair: implement primaryRange and ranges options
This patch implements repair's "primaryRange" and "ranges" options:

Without these options, a repair defaults to repair all the ranges for which
this nodes holds a replica (each range is repaired by contacting the other
replicas of this range).

If the "primaryRange" option is passed, instead of repairing all ranges, only
the "primary ranges" of this node is repaired - for each range, only one node
has this range as its "primary range". The intention is that a user can start
a "primaryRange" repair on all nodes, and the result would be that each range
will only be repaired once.

If the "ranges" option is passed, it can explicitly list a list of ranges to
repair, overriding the automatic determination of ranges explained above.

Fixes #212.

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-31 10:02:03 +03:00
Nadav Har'El
cc4117d6c1 repair: do not use an atomic integer
Avi asked not to use an atomic integer to produce ids for repair
operations. The existing code had another bug: It could return some
id immediately, but because our start_repair() hasn't started running
code on cpu 0 yet, the new id was not yet registered and if we were to
call repair_get_status() for this id too quickly, it could fail.

The solution for both issues is that start_repair() should return not
an int, but a future<int>: the integer id is incremented on cpu 0 (so
no atomics are needed), and then returned and the future is fulfilled.

Note that the future returned by start_repair() does not wait for the
repair to be over - just for its index to be registered and be usable
to a call to repair_get_status().

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-08-31 09:31:19 +03:00
Gleb Natapov
821d81786e fix timeout of background read repair request
Do not set _cl_promise on timeout if timeout happens after cl is
achieved. It may happen for background read repair requests.
2015-08-30 19:07:29 +03:00
Gleb Natapov
5bb37bc92e fix race between speculating read timer and request completion
Speculating timer may expire after request is complete, but before a
continuation that cancels it runs. In this case the timer should not
initiate additional request and just do nothing instead.
2015-08-30 19:07:29 +03:00
Avi Kivity
2ef5816996 Merge seastar upstream
* seastar a503442...9cc5cd0 (3):
  > fstream: fix write-behind on filesystems that don't support fallocate()
  > fstream: return correct error
  > fstream: reinitialize _background_writes_done after an error
2015-08-30 15:18:28 +03:00
Avi Kivity
4ec4a4b53c Merge seastar upstream
* seastar 2e041c2...a503442 (4):
  > fstream: write-behind
  > output_stream: improve flush() support
  > thread: initialize stack in debug mode
  > sharded: do not capture remote service pointer on remote invocation lambda
2015-08-30 12:09:51 +03:00
Avi Kivity
554645db91 Revert "Merge "Move the API configuration from command line to configuration" from Amnon"
See issue #59 for details.

This reverts commit 5aa0244d32, reversing
changes made to 7fb109a58d.
2015-08-30 12:09:00 +03:00
Avi Kivity
15987f80cf Merge "Avoid allocations in the read indexes path" from Glauber
"We can avoid small allocations when doing read_index. Doing that will yield
us another 4 % gain.

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)"
2015-08-30 08:43:18 +03:00
Glauber Costa
b1bfcda38c column helper: loop once only while gathering statistics.
the code to gather statistics about the column_name is showing in the
benchmark.

If we really want to collect those statistics, I guess they will never be free
because they involve a byte copy which implies an allocation.

But one easy thing we can do to make it better, is collect both min and max
statistics in the same loop. There is also no need to special case the case of
an empty vector, since may_grow will already take care of that.

That yields us a ~ 0.77 % boost, which although not earth shattering, is easy
enough for us not to reap.

Before:
200582.94 +- 293.91 partitions / sec (30 runs, 1 concurrent ops)

After:
202120.06 +- 341.95 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-30 08:43:02 +03:00
Glauber Costa
aab1ae9dc1 index_entry: don't generate a temporary bytes element
The one thing that is still showing pretty high at the read_indexes flamegraph,
is allocations.

We can, however, do better. Since most of the index is the keys anyway - and we need
all of them, the amount of memory we use by copying the buffers over is about the same
as the space we would use by just keeping the buffers around.

So we can change index_entry to just keep the shared_buffers, and since we always access
it through views anyway, that is perfectly fine. The index_entry destructor will then
release() the temporary_buffer, instead of doing this after the buffer copy.

This gives us a nice additional 4 %.

perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode index_read

Before:
839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops)

After:
873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:09:53 -05:00
Glauber Costa
a9ab31dd9c index_entry: move its fields to private visibility
And provide accessors. This will give us the freedom to change their internal
storage.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
1fbd14354f index_entry: provide a constructor
This is a preparation to have their internal fields as private.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
13d59c9618 index_entry: do away with the disk_string<> fields
Now that we are using the NSM, and not the general parser for the index, there
is no reason to keep using disk_string<>s in it. Since it is staying in the way
of further optimizations, let's get rid of it and use bytes directly.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 14:05:36 -05:00
Glauber Costa
b53511b422 sstables: don't return after processing collections
The code as is is blatantly wrong, and is an artifact of the seastar-thread
conversion.

This happened because the way we move to the next element in a do_for_each
future loop, is by returning the current lambda, and so it was converted this
way. Since we are now using a for loop, we should not return: we should continue.

I found this while searching for a bug, which is unfortunately not fixed by this.
But this is totally wrong, and has to be fixed.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 20:37:39 +03:00
Glauber Costa
2623362d20 continuous_data_consumer: do not pass reference to child
Since the child is a base class, we don't need to pass a reference: we can
just cast our 'this' pointer.

By doing that, the move constructor can come back.

Welcome back, move constructor.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 20:32:56 +03:00
Avi Kivity
5aa0244d32 Merge "Move the API configuration from command line to configuration" from Amnon
"This series address issues #59 and #23.

It moves the API configuration from the command line argument to the general
config, it also move the api-doc directory to be configurable instead of hard
coded."

Fixes #59
Fixes #23
2015-08-29 12:34:04 +03:00
Avi Kivity
7fb109a58d Merge "Types cleanup" from Pekka
"Remove type name duplication in types.cc."
2015-08-29 11:48:41 +03:00
Glauber Costa
0dd57fbca8 checksummed file writer: some cleanups
- no need to mark us as a friend of file_writer
- should be constructing the fields directly instead of using the constructors body.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 11:44:48 +03:00
Glauber Costa
66cc546781 sstable writer: compute checksum at larger chunks
What we are doing now, is computing checksum at every write() operation, possibly
at a small byte quantity - like 2 or 4 bytes, since we write those a lot as sizes.

While adler32 allows those computations and make them very easy, that doesn't mean
they are efficient. It is a lot more efficient to compute the checksum on larger
buffer.

We can do that by doing it at put() time in a data_sink_impl, instead of
keeping that in the file abstraction. The code for the checksum itself now also
becomes remarkably simpler - since there is no need anymore to keep state:
we'll always be presented with full buffers.

The data sink implementation and the file_writer share the full_checksum and
the checksum struct variables: and with that in place, the file writer can
still expose the final results of the computation in the same way it does at
present.

Benchmarked with:
perf_sstable_g  --smp 1 --iterations 30 --parallelism 1 --mode write --num_columns 5 --partitions 500000

Before:
178829.07 +- 141.28 partitions / sec (30 runs, 1 concurrent ops)
After:
199744.71 +- 201.64 partitions / sec (30 runs, 1 concurrent ops)

gain: 11.70 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-29 11:44:47 +03:00
Avi Kivity
e9917a5862 Merge "Improve read index performance further" from Glauber
"This patch improves the read_indexes performance by an extra 16 %.
The total gain so far is now 98 %, and although there are still things
I believe we can do to improve it further, I consider a 2-fold increase
sufficient to declare Issue #94 fixed.

So:

Fixes #94

The speed up is achieved by converting the reader to the NSM. To do that, I had
to commonize most parts of the NSM. I had attempted this before, and for this
new cycle, I had a new tool to aid me in this task: the sstable performance
microbenchmark.

Every change to the NSM was individually tested to make sure the performance
of the read path was not regressing. When it did regress, I took alternate
approaches and tried my best to discuss the whys in the changelogs, with
the appropriate result.

So I can be quite confident in affirming that we are not taking any drop
here, while read_index performance is increased significantly"
2015-08-29 11:28:03 +03:00
Amnon Heiman
f1cda74c15 API: storage_service - return an error for wrong keyspace name
This patch addresses issu #155, it adds a helper function that if a
keyspace does not exists it throw a bad parameter exception.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-29 11:22:27 +03:00
Glauber Costa
babccb1112 read_indexes: convert to the NSM
Reading each member individually is not as efficient. Better convert to
the NSM.

Before:
717101.20 +- 649.77 partitions / sec (30 runs, 1 concurrent ops)
After:
838169.80 +- 575.04 partitions / sec (30 runs, 1 concurrent ops)

Gains:
16.88 %

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 19:07:39 -05:00
Glauber Costa
4b174c754d commonize the NSM
In order to reuse the NSM in other scenarios, we need to push as much code
as possible into a common class.

This patch does that, making the continuous_data_consumer class now the main
placeholder for the NSM class. The actual readers will have to inherit from it.

However, despite using inheritance, I am not using virtual functions at all
instead, we let the continuous_data_consumer receive an instance of the derived
class, and then it can safely call its methods without paying the cost of
virtual functions.

In other attempt, I had kept the main process() function in the derived class,
that had the responsibility of then coding the loop.

With the use of the new pattern, we can keep the loop logic in the base class,
which is a lot cleaner. There is a performance penalty associated with it, but
it is fairly small: 0.5 % in the sequential_read perf_sstable test. I think we
can live with it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 18:56:26 -05:00
Glauber Costa
f8d35ef5ec sstables: move exception to its own file.
I am moving the malformed exception here, to avoid circular dependencies.
But since the file now exists, let's move them all.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:44 -05:00
Glauber Costa
d9b7f4bde3 row consumer: separate processing of buffers from the main loop
In my previous attempt, I have separated the state processor for the main loop,
leaving that to be filled by a derived class.

That felt a lot more natural, because then we don't have to replicate the loop
logic in the derived classes.

But well, oh, well, life is hard. Specially on fast paths. Doing that makes us
insert an extra call in this loop, and that is noticeable: we would be 1.5 %
slower, and that is not even counting the cost of making the state processing a
virtual function later on.

I could just argue that this is acceptable due to decoupling gains, but why I
would argue that, if I can just rewrite it in a way that no performance is
lost?

And then I did. The disadvantage of this, is that the derived class will now
have to re-code the loop, but no performance is lost. Another advantage of
this, is that the derived class will now be able to call into process_buffer
directly, without using virtual functions in this path for any of them.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-28 17:30:44 -05:00