scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Calle Wilund	bbf82e80d0	Commitlog: Allow skipping X bytes in commit log reader Also refactor reader into named methods for debugging sanity.	2015-08-31 14:29:49 +02:00
Calle Wilund	da9ea641e5	Commitlog: Handle full paths in descriptor file name parse.	2015-08-31 14:29:48 +02:00
Calle Wilund	02d2bef1f2	Commitlog: Expose convinience method "list_existing_segments"	2015-08-31 14:29:48 +02:00
Calle Wilund	19052b3c09	Commitlog: Expose list_existing_descriptors	2015-08-31 14:29:48 +02:00
Calle Wilund	e068ffb5a5	Commitlog: Make file reader provide replay_position for entries	2015-08-31 14:29:47 +02:00
Calle Wilund	41b1ad8600	Commitlog: Make descriptor type visible/usable from outside	2015-08-31 14:29:47 +02:00
Calle Wilund	f14e3cf8d0	Database: do not create shard-specific dirs for commitlog New ID scheme allows for a single dir for all segments from all shards.	2015-08-31 14:29:46 +02:00
Calle Wilund	ea38b223bd	Commitlog: change the ID generation scheme * Make it more like origin, i.e. based on wall clock time of app start * Encode shard ID in the, RP segement ID, to ensure RP:s and segement names are unique per shard	2015-08-31 14:29:46 +02:00
Calle Wilund	4ac07fa87d	Commitlog test: remove some hardcoded assumptions on segment IDs To enable changing the ID generation scheme.	2015-08-31 14:29:45 +02:00
Calle Wilund	c040565bf9	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-31 14:29:45 +02:00
Calle Wilund	d4ae43862d	SStables: Use db::commitlog::replay_position (not own type)	2015-08-31 14:29:45 +02:00
Calle Wilund	0fcf7e3e91	Commitlog: Make "position" type 32-bit to align replay_position with Origin * Note: removed commitlog_test:test_allocation_failure because with segments limited to 4GB -> mutation limited to 2GB, actually forcing a fail is not guaranteed or even likely.	2015-08-31 14:29:44 +02:00
Calle Wilund	3f1a91b89c	Commitlog: do not eagerly create first segment on init Deferring makes it easier to separate old segments from new, which in turn helps replay logic.	2015-08-31 13:11:44 +02:00
Tomasz Grabiec	fabda26871	Merge branch 'penberg/fix-create-keyspace-validation/v1' from seastar-dev.git From Pekka: This adds replication strategy validation and wires it up to CREATE KEYSPACE via migration manager. Fixes #191.	2015-08-31 12:21:52 +02:00
Pekka Enberg	01900996b9	service/migration_manager: Wire up create keyspace validation Fixes #191. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 11:54:56 +03:00
Pekka Enberg	04a65ec06f	database: Add keyspace_metadata::validate() helper Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 11:54:56 +03:00
Pekka Enberg	5a9cff9dc0	locator/abstract_replication_strategy: Add validate_replication_strategy() helper Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 11:54:56 +03:00
Pekka Enberg	b8211c436b	locator/abstract_replication_strategy: Add validate_options() helper Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 11:54:56 +03:00
Pekka Enberg	cff9eb520b	locator/abstract_replication_strategy: Add recognized_options() helper Add a helper function for obtaining a vector of supported replication strategy options. This is needed for validation. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-08-31 11:54:31 +03:00
Nadav Har'El	f6ae567ab1	repair: implement primaryRange and ranges options This patch implements repair's "primaryRange" and "ranges" options: Without these options, a repair defaults to repair all the ranges for which this nodes holds a replica (each range is repaired by contacting the other replicas of this range). If the "primaryRange" option is passed, instead of repairing all ranges, only the "primary ranges" of this node is repaired - for each range, only one node has this range as its "primary range". The intention is that a user can start a "primaryRange" repair on all nodes, and the result would be that each range will only be repaired once. If the "ranges" option is passed, it can explicitly list a list of ranges to repair, overriding the automatic determination of ranges explained above. Fixes #212. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-31 10:02:03 +03:00
Nadav Har'El	cc4117d6c1	repair: do not use an atomic integer Avi asked not to use an atomic integer to produce ids for repair operations. The existing code had another bug: It could return some id immediately, but because our start_repair() hasn't started running code on cpu 0 yet, the new id was not yet registered and if we were to call repair_get_status() for this id too quickly, it could fail. The solution for both issues is that start_repair() should return not an int, but a future<int>: the integer id is incremented on cpu 0 (so no atomics are needed), and then returned and the future is fulfilled. Note that the future returned by start_repair() does not wait for the repair to be over - just for its index to be registered and be usable to a call to repair_get_status(). Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-31 09:31:19 +03:00
Gleb Natapov	821d81786e	fix timeout of background read repair request Do not set _cl_promise on timeout if timeout happens after cl is achieved. It may happen for background read repair requests.	2015-08-30 19:07:29 +03:00
Gleb Natapov	5bb37bc92e	fix race between speculating read timer and request completion Speculating timer may expire after request is complete, but before a continuation that cancels it runs. In this case the timer should not initiate additional request and just do nothing instead.	2015-08-30 19:07:29 +03:00
Avi Kivity	2ef5816996	Merge seastar upstream * seastar a503442...9cc5cd0 (3): > fstream: fix write-behind on filesystems that don't support fallocate() > fstream: return correct error > fstream: reinitialize _background_writes_done after an error	2015-08-30 15:18:28 +03:00
Avi Kivity	4ec4a4b53c	Merge seastar upstream * seastar 2e041c2...a503442 (4): > fstream: write-behind > output_stream: improve flush() support > thread: initialize stack in debug mode > sharded: do not capture remote service pointer on remote invocation lambda	2015-08-30 12:09:51 +03:00
Avi Kivity	554645db91	Revert "Merge "Move the API configuration from command line to configuration" from Amnon" See issue #59 for details. This reverts commit `5aa0244d32`, reversing changes made to `7fb109a58d`.	2015-08-30 12:09:00 +03:00
Avi Kivity	15987f80cf	Merge "Avoid allocations in the read indexes path" from Glauber "We can avoid small allocations when doing read_index. Doing that will yield us another 4 % gain. Before: 839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops) After: 873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops)"	2015-08-30 08:43:18 +03:00
Glauber Costa	b1bfcda38c	column helper: loop once only while gathering statistics. the code to gather statistics about the column_name is showing in the benchmark. If we really want to collect those statistics, I guess they will never be free because they involve a byte copy which implies an allocation. But one easy thing we can do to make it better, is collect both min and max statistics in the same loop. There is also no need to special case the case of an empty vector, since may_grow will already take care of that. That yields us a ~ 0.77 % boost, which although not earth shattering, is easy enough for us not to reap. Before: 200582.94 +- 293.91 partitions / sec (30 runs, 1 concurrent ops) After: 202120.06 +- 341.95 partitions / sec (30 runs, 1 concurrent ops) Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-30 08:43:02 +03:00
Glauber Costa	aab1ae9dc1	index_entry: don't generate a temporary bytes element The one thing that is still showing pretty high at the read_indexes flamegraph, is allocations. We can, however, do better. Since most of the index is the keys anyway - and we need all of them, the amount of memory we use by copying the buffers over is about the same as the space we would use by just keeping the buffers around. So we can change index_entry to just keep the shared_buffers, and since we always access it through views anyway, that is perfectly fine. The index_entry destructor will then release() the temporary_buffer, instead of doing this after the buffer copy. This gives us a nice additional 4 %. perf_sstable_g --smp 1 --iterations 30 --parallelism 1 --mode index_read Before: 839484.65 +- 585.52 partitions / sec (30 runs, 1 concurrent ops) After: 873323.18 +- 442.52 partitions / sec (30 runs, 1 concurrent ops) Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:09:53 -05:00
Glauber Costa	a9ab31dd9c	index_entry: move its fields to private visibility And provide accessors. This will give us the freedom to change their internal storage. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	1fbd14354f	index_entry: provide a constructor This is a preparation to have their internal fields as private. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	13d59c9618	index_entry: do away with the disk_string<> fields Now that we are using the NSM, and not the general parser for the index, there is no reason to keep using disk_string<>s in it. Since it is staying in the way of further optimizations, let's get rid of it and use bytes directly. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	b53511b422	sstables: don't return after processing collections The code as is is blatantly wrong, and is an artifact of the seastar-thread conversion. This happened because the way we move to the next element in a do_for_each future loop, is by returning the current lambda, and so it was converted this way. Since we are now using a for loop, we should not return: we should continue. I found this while searching for a bug, which is unfortunately not fixed by this. But this is totally wrong, and has to be fixed. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 20:37:39 +03:00
Glauber Costa	2623362d20	continuous_data_consumer: do not pass reference to child Since the child is a base class, we don't need to pass a reference: we can just cast our 'this' pointer. By doing that, the move constructor can come back. Welcome back, move constructor. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 20:32:56 +03:00
Avi Kivity	5aa0244d32	Merge "Move the API configuration from command line to configuration" from Amnon "This series address issues #59 and #23. It moves the API configuration from the command line argument to the general config, it also move the api-doc directory to be configurable instead of hard coded." Fixes #59 Fixes #23	2015-08-29 12:34:04 +03:00
Avi Kivity	7fb109a58d	Merge "Types cleanup" from Pekka "Remove type name duplication in types.cc."	2015-08-29 11:48:41 +03:00
Glauber Costa	0dd57fbca8	checksummed file writer: some cleanups - no need to mark us as a friend of file_writer - should be constructing the fields directly instead of using the constructors body. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 11:44:48 +03:00
Glauber Costa	66cc546781	sstable writer: compute checksum at larger chunks What we are doing now, is computing checksum at every write() operation, possibly at a small byte quantity - like 2 or 4 bytes, since we write those a lot as sizes. While adler32 allows those computations and make them very easy, that doesn't mean they are efficient. It is a lot more efficient to compute the checksum on larger buffer. We can do that by doing it at put() time in a data_sink_impl, instead of keeping that in the file abstraction. The code for the checksum itself now also becomes remarkably simpler - since there is no need anymore to keep state: we'll always be presented with full buffers. The data sink implementation and the file_writer share the full_checksum and the checksum struct variables: and with that in place, the file writer can still expose the final results of the computation in the same way it does at present. Benchmarked with: perf_sstable_g --smp 1 --iterations 30 --parallelism 1 --mode write --num_columns 5 --partitions 500000 Before: 178829.07 +- 141.28 partitions / sec (30 runs, 1 concurrent ops) After: 199744.71 +- 201.64 partitions / sec (30 runs, 1 concurrent ops) gain: 11.70 % Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 11:44:47 +03:00
Avi Kivity	e9917a5862	Merge "Improve read index performance further" from Glauber "This patch improves the read_indexes performance by an extra 16 %. The total gain so far is now 98 %, and although there are still things I believe we can do to improve it further, I consider a 2-fold increase sufficient to declare Issue #94 fixed. So: Fixes #94 The speed up is achieved by converting the reader to the NSM. To do that, I had to commonize most parts of the NSM. I had attempted this before, and for this new cycle, I had a new tool to aid me in this task: the sstable performance microbenchmark. Every change to the NSM was individually tested to make sure the performance of the read path was not regressing. When it did regress, I took alternate approaches and tried my best to discuss the whys in the changelogs, with the appropriate result. So I can be quite confident in affirming that we are not taking any drop here, while read_index performance is increased significantly"	2015-08-29 11:28:03 +03:00
Amnon Heiman	f1cda74c15	API: storage_service - return an error for wrong keyspace name This patch addresses issu #155, it adds a helper function that if a keyspace does not exists it throw a bad parameter exception. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-08-29 11:22:27 +03:00
Glauber Costa	babccb1112	read_indexes: convert to the NSM Reading each member individually is not as efficient. Better convert to the NSM. Before: 717101.20 +- 649.77 partitions / sec (30 runs, 1 concurrent ops) After: 838169.80 +- 575.04 partitions / sec (30 runs, 1 concurrent ops) Gains: 16.88 % Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 19:07:39 -05:00
Glauber Costa	4b174c754d	commonize the NSM In order to reuse the NSM in other scenarios, we need to push as much code as possible into a common class. This patch does that, making the continuous_data_consumer class now the main placeholder for the NSM class. The actual readers will have to inherit from it. However, despite using inheritance, I am not using virtual functions at all instead, we let the continuous_data_consumer receive an instance of the derived class, and then it can safely call its methods without paying the cost of virtual functions. In other attempt, I had kept the main process() function in the derived class, that had the responsibility of then coding the loop. With the use of the new pattern, we can keep the loop logic in the base class, which is a lot cleaner. There is a performance penalty associated with it, but it is fairly small: 0.5 % in the sequential_read perf_sstable test. I think we can live with it. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 18:56:26 -05:00
Glauber Costa	f8d35ef5ec	sstables: move exception to its own file. I am moving the malformed exception here, to avoid circular dependencies. But since the file now exists, let's move them all. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 17:30:44 -05:00
Glauber Costa	d9b7f4bde3	row consumer: separate processing of buffers from the main loop In my previous attempt, I have separated the state processor for the main loop, leaving that to be filled by a derived class. That felt a lot more natural, because then we don't have to replicate the loop logic in the derived classes. But well, oh, well, life is hard. Specially on fast paths. Doing that makes us insert an extra call in this loop, and that is noticeable: we would be 1.5 % slower, and that is not even counting the cost of making the state processing a virtual function later on. I could just argue that this is acceptable due to decoupling gains, but why I would argue that, if I can just rewrite it in a way that no performance is lost? And then I did. The disadvantage of this, is that the derived class will now have to re-code the loop, but no performance is lost. Another advantage of this, is that the derived class will now be able to call into process_buffer directly, without using virtual functions in this path for any of them. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 17:30:44 -05:00
Glauber Costa	fbd68c3b01	row consumer: move consume_be to consumer.hh It will be reused by the continuous_data_consumer Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 17:30:43 -05:00
Glauber Costa	e1945e473b	row consumer: make non_consuming an instance member It is now a static member that gets the instance members as parameters. There is no reason for that, and this will complicate the decoupling, since the prestate reader won't know about state. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com> Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-28 17:18:19 -05:00
Glauber Costa	f45b807f34	row consumer: move proceed class to a separate class Continuing the work of decoupling the the prestate and state parts of the NSM so we can reuse it, move the proceed class to a different holding class. Proceeding or not has nothing to do with "rows". Signed-off-by: Glauber Costa <glommer@cloudius-systems.com> Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-28 17:18:06 -05:00
Glauber Costa	49ac04a60a	row consumer: fall through more often Because we didn't had before a way to know whether or not the read completed, we would always go back to the main loop, and would only optimize sequential reads for some kinds of data. However, As one could see in the previous patch, the new read_X functions will notify completion, allowing us to just fallthrough to the next case if that is the only possibility. In most cases, it isn't. With this, we can apply this optimization throughout all cases where we don't branch states, and with a very elegant resulting code. The performance actually increases by 0.75 %. It is not much, but it is more than the error margin (which sits at 0.20 %), and because the code is not made unreadable by it, this is a clear win to me. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	1f930cda4a	row consumer: extend use of read for multi-value fields In an attempt to gain some cycles, we are testing whether we can read many values at once, and if so, using consume_be directly for those. What we can do in this situation, is read the first value, and let the read fall through the next case if the read succeeds. The code actually looks a lot more elegant this way. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	0ad8afb0ec	row consumer: extend usage of the read_* functions In some places, we cannot use our read_* functions, because we don't know whether or not it succeeded, and that is important when passing the state along. The fix for this is trivial, since we can just return it from the reader. Note for reviewers: The commend in one of the functions say we should use: "read_bytes(data, _u32, _key ...". But in the actual code, the where buffer is _val, not _key. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 16:30:22 -05:00

1 2 3 4 5 ...

6092 Commits