scylladb

Author	SHA1	Message	Date
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Tomasz Grabiec	ab274b8203	sstables: clustered_index: Relax scope of validity of entry_info entry_info holds views, which may get invalidated when the containing index blocks are removed. Current implementations of next_entry() keeps the blocks in memory as long as the cursor is alive but that will change in new implementations of the cursor. Adjust the assumption of tests accordingly.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	f2e52c433f	sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	d5bf540079	sstables: Abstract the clustering index cursor behavior In preparation for supporting more than one algorithm for lookups in the promoted index, extract relevant logic out of the index_reader (which is a partition index cursor). The clustered index cursor implementation is now hidden behind abstract interface called clustered_index_cursor. The current implementation is put into the scanning_clustered_index_cursor. It's mostly code movement with minor adjustments. In order to encapsulate iteration over promoted index entries, clustered_index_cursor::next_entry() was introduced. No change in behavior intended in this patch.	2020-06-16 16:14:17 +02:00
Pavel Emelyanov	60e283b23e	auth: Move away from storage_service Now after the auth start/stop is standalone, we can remove reference from storage service to it. This frees some tests from the need to carry the auth service around for nothing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Pavel Emelyanov	6a46721fb7	auth: Move start-stop code into main The auth service management is currently sitting in storage service, but it was needed there just for cql/thrift start code. After the latters has been moved away there are no other reasons for the auth to be integrated with the storage service, so move it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Rafael Ávila de Espíndola	555d8fe520	build: Be consistent about system versus regular headers We were not consistent about using '#include "foo.hh"' instead of '#include <foo.hh>' for scylla's own headers. This patch fixes that inconsistency and, to enforce it, changes the build to use -iquote instead of -I to find those headers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200608214208.110216-1-espindola@scylladb.com>	2020-06-10 15:49:51 +03:00
Dejan Mircevski	9027b6636f	Use sstring_view in execute_cql and assertions This lets the functions operate on a wider variety of arguments and may also be faster. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-10 08:10:43 +03:00
Glauber Costa	3972628fc0	compaction: split compaction.hh header compaction.hh is one of our heavy headers, but some users just want to use information on it about how to describe a compaction, not how to perform one. For that reason this patch splits the compaction_descriptor into a new header. The compaction_descriptor has, as a member type, compaction_options. That is moved too, and brings with it the compaction_type. Both of those structures would make sense in a separate header anyway. The compaction_descriptor also wants the creator_fn and replacer_fn functions. We also take this opportunity to rename them into something more descriptive Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-08 16:06:00 -04:00
Avi Kivity	6f394e8e90	tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators The comparison operator (<=>) default implementation happens to exactly match tombstone::compare(), so use the compiler-generated defaults. Also default operator== and operator!= (these are not brought in by operator<=>). These become slightly faster as they perform just an equality comparison, not three-way compare. shadowable_tombstone and row_tombstone depend on tombstone::compare(), so convert them too in a similar way. with_relational_operations.hh becomes unused, so delete it. Tests: unit (dev) Message-Id: <20200602055626.2874801-1-avi@scylladb.com>	2020-06-02 09:28:52 +03:00
Piotr Sarna	160e2b06f9	test: move random string helpers to .cc ... since there's no reason for them to reside in a header, and .cc is our default destination. Message-Id: <2509410f0f71df036a7829f1f799503c1a671404.1591078777.git.sarna@scylladb.com>	2020-06-02 09:27:59 +03:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Piotr Sarna	91e02ed3ad	test/lib: add generating random numeric string Useful for testing random numeric inputs, e.g. big decimals.	2020-06-01 16:11:49 +02:00
Botond Dénes	c5b0e8a45a	test: move thread-safe test macro alternatives to lib/test_utils.hh Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200529130706.149603-2-bdenes@scylladb.com>	2020-05-31 16:08:02 +03:00
Botond Dénes	d68ac8bf18	treewide: remove all uses of no_reader_permit()	2020-05-28 11:34:35 +03:00
Botond Dénes	b5aa08ed77	sstables: pass valid permits to all internal reads We will soon require a valid permit for all reads, including low level index reads. The sstable layer has several internal reads which can not be associated with either the user or the system read semaphores or it would be very hard to obtain the correct semaphore, for limited/no gain. To be able to pass a valid permit still, we either expose a permit parameter so upper layers can pass down one, or create a local semaphore for these reads and use that to obtain a permit. The following methods now require a permit to be passed to them: * `sstables::sstabe::read_data()`: only used in tests. The following methods use internal semaphores: * `sstables::sstable::generate_summary()` used when loading an sstable. * `sstables::sstable::has_partition_key()`: used by a REST API method.	2020-05-28 11:34:35 +03:00
Botond Dénes	734e995639	database: add compaction read concurrency semaphore All reads will soon require a valid permit, including those done during compaction. To allow creating valid permits for these reads create a compaction specific semaphore. This semaphore is unlimited as compaction concurrency is managed by higher level layer, we use just for resource usage accounting.	2020-05-28 11:34:35 +03:00
Botond Dénes	a08467da29	test: move away from reader_concurrency_semaphore::wait_admission() And use the reader_permit for this instead. This refactoring has revealed a pre-existing bug in the `test_lifecycle_policy`, which is also addressed in this patch. The bug is that said policy executes reader destructions in the background, and these are not waited for. For some reason, the semaphore -> permit transition pushes these races over the edge and we start seeing some of these destruction fibers still being unfinished when test scopes are exited, causing all sorts of trouble. The solution is to introduce a special gate that tests can use to wait for all background work to finish, before the test scope is exited.	2020-05-28 11:34:35 +03:00
Botond Dénes	9ede82ebf8	memtable: pass a valid permit to the delegate reader All reader are soon going to require a valid permit, so make sure we have a valid permit which we can pass to the delegate reader when creating it. This means `memtable::make_flat_reader()` now also requires a permit to be passed to it. Internally the permit is stored in `scanning_reader`, which is used both for flushes and normal reads. In the former case a permit is not required.	2020-05-28 11:34:35 +03:00
Botond Dénes	cc5137ffe3	table: require a valid permit to be passed to most read methods Now that the most prevalent users (range scan and single partition reads) all pass valid permits we require all users to do so and propagate the permit down towards `make_sstable_reader()`. The plan is to use this permit for restricting the sstable readers, instead of the semaphore the table is configured with. The various `make_streaming_*reader()` overloads keep using the internal semaphores as but they also create the permit before the read starts and pass it to `make_sstable_reader()`.	2020-05-28 11:34:35 +03:00
Botond Dénes	0ee58d1d47	test: lib/reader_permit.hh: add make_query_class_config() To be used by tests to obtain a query_class_config to pass to APIs that require one. The class config contains the test semaphore.	2020-05-28 11:34:35 +03:00
Botond Dénes	97af2d98d2	test: lib: introduce reader_permit.{hh,cc} This contains a reader concurrency semaphore for the tests, that they can use to obtain a valid permit for reads. Soon we are going to start working towards a point where all APIs taking a permit will require a valid one. Before we start this work we must ensure test code is able to obtain a valid permit.	2020-05-28 11:34:35 +03:00
Pavel Emelyanov	70391feb8e	storage_service: Tossing bits around The goal is to have main.cc add code between prepare_to_join and join_token_ring. As a side effect this drives us closer to proper split of storage service into sharded service itslef vs start/boot/join code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 13:21:08 +03:00
Pavel Emelyanov	bb3a71529a	features: Get rid of per-features booleans The set of bool enable_something-s on feature_fonfig duplicates the disabled_features set on it, so remove the former and make full use of the latter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-05-25 13:09:12 +03:00
Glauber Costa	7423ccc318	compaction_manager: allow early aborts through abort sources. The shutdown process of compaction manager starts with an explicit call from the database object. However that can only happen everything is already initialized. This works well today, but I am soon to change the resharding process to operate before the node is fully ready. One can still stop the database in this case, but reshardings will have to finish before the abort signal is processed. This patch passes the existing abort source to the construction of the compaction_manager and subscribes to it. If the abort source is triggered, the compaction manager will react to it firing and all compactions it manages will be stopped. We still want the database object to be able to wait for the compaction manager, since the database is the object that owns the lifetime of the compaction manager. To make that possible we'll use a future that is return from stop(): no matter what triggered the abort, either an early abort during initial resharding or a database-level event like drain, everything will shut down in the right order. The abort source is passed to the database, who is responsible from constructing the compaction manager. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-05-13 16:51:25 -04:00
Glauber Costa	e29701ca1c	compaction_manager: expand state to be able to differentiate between enabled and stopped We are having many issues with the stop code in the compaction_manager. Part of the reason is that the "stopped" state has its meaning overloaded to indicate both "compaction manager is not accepting compactions" and "compaction manager is not ready or destructed". In a later step we could default to enabled-at-start, but right now we maintain current behavior to minimize noise. It is only possible to stop the compaction manager once. It is possible to enable / disable the compaction manager many times. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-05-13 16:51:25 -04:00
Botond Dénes	e0f5ef5ef0	test: lib/sstable_utils: add make_keys_for_shard A variant of make_keys() which creates keys for the requested shard. As this version is more generic than the existing local_shards_only variant, the former is reimplemented on top of the latter.	2020-05-12 12:07:21 +03:00
Avi Kivity	5b971397aa	Revert "compaction_manager: allow early aborts through abort sources." This reverts commit `e8213fb5c3`. It results in an assertion failure in remove_index_file_test. Fixes #6413.	2020-05-10 12:32:18 +03:00
Glauber Costa	e8213fb5c3	compaction_manager: allow early aborts through abort sources. The shutdown process of compaction manager starts with an explicit call from the database object. However that can only happen everything is already initialized. This works well today, but I am soon to change the resharding process to operate before the node is fully ready. One can still stop the database in this case, but reshardings will have to finish before the abort signal is processed. This patch passes the existing abort source to the construction of the compaction_manager and subscribes to it. If the abort source is triggered, the compaction manager will react to it firing and all compactions it manages will be stopped. We still want the database object to be able to wait for the compaction manager, since the database is the object that owns the lifetime of the compaction manager. To make that possible we'll use a future that is return from stop(): no matter what triggered the abort, either an early abort during initial resharding or a database-level event like drain, everything will shut down in the right order. The abort source is passed to the database, who is responsible from constructing the compaction manager. Tests: unit (dev), manual start+stop, manual drain + stop Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20200506184749.98288-1-glauber@scylladb.com>	2020-05-07 13:24:47 +03:00
Avi Kivity	2b0c317dec	test: lib: exception_utils: fix crash with fmt-6.2.0 fmt, the formatting library we use, detects types with conversion to std::string_view (and formats them as strings) and types that support operator<<(std::ostream, const T&) (and performs custom formatting on them). However, if <fmt/ostream.h>, the latter is not done. The problem happens with seastar::sstring, which implements both, and debug mode, which disables inlining. Some translation units do include <fmt/ostream.h>, and so generate code to do custom formatting. exception_utils.cc doesn't, and so generates code to format via string_view conversion. At link time, the compiler picks one of the generated functions and includes it in the final binary; it happened to pick one generated outside exception_utils.cc, using custom formatting. However, there is also code in fmt to encode which path fmt chose - string_view or custom. This code is constexpr and so is evaluated in exception_utils.cc. The result is that the function to perform formatting of seastar::sstring uses custom formatting, while the descriptor containing the method used says it is formatting via string_view. This is enough to cause a crash. The problem is limited to debug mode, since in other modes all this code is inlined, and so is consistent within the translation unit. We need a more general fix (hopefully in fmt), but for now a simple fix is to add the missing include. Ref https://github.com/fmtlib/fmt/issues/1662	2020-05-07 08:59:02 +03:00
Calle Wilund	08d069f78d	messaging_service: Use reloadable TLS certificates Changes messaging service rpc to use reloadable tls certificates iff tls is enabled- Note that this means that the service cannot start listening at construction time if TLS is active, and user need to call start_listen_ex to initialize and actually start the service. Since "normal" messaging service is actually started from gms, this route too is made a continuation.	2020-05-04 11:32:21 +00:00
Tomasz Grabiec	3e74dd4df3	sstables: Move all_sstable_versions to version.hh	2020-04-17 11:34:02 +02:00
Avi Kivity	88ade3110f	treewide: replace calls to engine().some_api() with some_api() This removes the need to include reactor.hh, a source of compile time bloat. In some places, the call is qualified with seastar:: in order to resolve ambiguities with a local name. Includes are adjusted to make everything compile. We end up having 14 translation units including reactor.hh, primarily for deprecated things like reactor::at_exit(). Ref #1	2020-04-05 12:46:04 +03:00
Avi Kivity	5e32ecb514	test: sstable-utils: deinline do_make_keys() This hides a call to engine_is_ready() which is only available in reactor.hh. Dependencies are adjusted so tests link. Ref #1.	2020-04-05 12:46:04 +03:00
Rafael Ávila de Espíndola	3f3634ece1	test: Use feature_config_from_db_config to setup feature_config This reduces code duplication and uses the same code path that is used in scylla itself. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200403170235.113558-1-espindola@scylladb.com>	2020-04-03 19:59:00 +02:00
Pekka Enberg	75b55cea88	Merge "Resharding through compact sstables" from Glauber " This patchseries is part of my effort to make resharding less special - and hopefully less problematic. The next steps are a bit heavy, so I'd like to, if possible, get this out of the way. After these two patches, there is no more need to ever call reshard_sstables: compact_sstables will do, and it will be able to recognize resharding compactions. To do that we need to unify the creator function, which is trivially done by adding a shard parameter to regular compactions as well: they can just ignore it. I have considered just making the compaction_descriptor have a virtual create() function and specializing it, but because we have to store the creator in the compaction object I decided to keep the virtual function for now. In a later cleanup step, if we can for instance store the entire compaction_descriptor object in the compaction object we could do that. Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Botond Dénes <bdenes@scylladb.com> Tests: unit tests (dev), dtest (resharding.py) " * 'resharding-through-compact-sstables' of github.com:glommer/scylla: resharding: get rid of special reshard_sstables compaction: enhance compaction_descriptor with creator and replace function	2020-04-02 14:43:35 +02:00
Konstantin Osipov	9948f548a5	lwt: remove Paxos from experimental list Always enable lightweight transactions. Remove the check for the command line switch from the feature service, assuming LWT is always enabled. Remove the check for LWT from Alternator. Note that in order for the cluster to work with LWT, all nodes need to support it. Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in --experimental-features command line option, but do nothing with it. Changes in v2: * remove enable_lwt feature flag, it's always there Closes #6102 test: unit (dev, debug) Message-Id: <20200401071149.41921-1-kostja@scylladb.com>	2020-04-01 09:12:21 +02:00
Glauber Costa	e8801cd77b	compaction: enhance compaction_descriptor with creator and replace function There are many differences between resharding and compaction that are artificial, arising more from the way we ended up implementing it than necessity. This patch attempts to pass the creator and replacer functions through the compaction_descriptor. There is a difference between the creator function for resharding and regular compaction: resharding has to pass the shard number on behalf of which the SSTable is created. However regular compactions can just ignore this. No need to have a special path just for this. After this is done, the constructor for the compaction object can be greatly simplified. In further patches I intend to simplify it a bit further, but some more cleanup has to happen first. To make that happen we have to construct a compaction_descriptor object inside the resharding function. This is temporary: resharding currently works with a descriptor, but at some point that descriptor is lost and broken into pieces to be passed to this function. The overarching goal of this work is exactly to be able to keep that descriptor for as long as possible, which should simplify things a lot. Callers are patched, but there are plenty for sstable_datafile_test.cc. For their benefit, a helper function is provided to keep the previous signature (test only). Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-03-31 19:41:25 -04:00
Piotr Jastrzebski	c44f019eee	dummy_sharder: rename dummy_sharding_info.* to dummy_sharder.* Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	e72696a8e6	sharding_info: rename the class to sharder Also rename all variables that were named si or sinfo to sharder. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	2e850421a0	i_partitioner:remove embeded sharding_info sharding_info embeded into partitioner is no longer used anywhere and can be removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	14ad965733	sstable-utils: use sharding_info::shard_of Create sharding_info with the same parameters as the partitioner and use it instead of the partitioner. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	dc2e060313	create_token_range_from_keys: use sharding info for shard_of Replace i_partitioner::shard_of with sharding_info::shard_of Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	41591f15d2	tests: rename dummy_partitioner.* to dummy_sharding_info.* dummy_partitioner was renamed to dummy_sharding_info in the previous patch. This patch cleans up the names of files. It's done in a separate patch to not obstruct the diff of previous patch. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:33 +02:00
Piotr Jastrzebski	031f589dba	multishard_combining_reader: use token_for_next_shard from sharding info not partitioner Previously this function was accessing sharding logic through partitioner obtained from the schema. While converting tests, dummy_partitioner is turned into dummy_sharding_info. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-03-30 18:42:25 +02:00
Rafael Ávila de Espíndola	c5795e8199	everywhere: Replace engine().cpu_id() with this_shard_id() This is a bit simpler and might allow removing a few includes of reactor.hh. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200326194656.74041-1-espindola@scylladb.com>	2020-03-27 11:40:03 +03:00
Botond Dénes	ec36c7cb2f	test: random_schema: remove redundant gc grace period from tombstone expiry Compaction automatically adds gc grace period to expiry times already, no need to add it when creating the tombstones. Remove the redundant additions form the code. The direct impact is really minor as this is only used in tests, but it might confuse readers who are looking at how tombstones are created across the codebase. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200323120948.92104-1-bdenes@scylladb.com>	2020-03-23 15:12:25 +02:00
Avi Kivity	0d885dbb00	Merge "Make all headers standalone" from Botond " Make sure all headers compile on their own, without requiring any additional includes externally. Even though this requirement is not documented in our coding guides it is still quasi enforced and we semi-regularly get and merge patches adding missing includes to headers. This patch-set fixes all headers and adds a `{mode}-headers` target that can be used to verify each header. This target should be built by promotion to ensure no new non-conforming code sneaks in. Individual headers can be verified using the `build/dev/path/to/header.hh.o` target, that is generated for every header. The majority of the headers was just missing `seastarx.hh`. I think we should just include this via a compiler flag to remove the noise from our code (in a followup). " * 'compiling-headers/v2' of https://github.com/denesb/scylla: configure.py: add {mode}-headers phony target treewide: add missing headers and/or forward declarations test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh sstables: size_tiered_backlog_tracker: move methods out-of-line sstables: date_tiered_compaction_strategy.hh: move methods out-of-line	2020-03-23 13:09:09 +02:00
Botond Dénes	e0284bb9ee	treewide: add missing headers and/or forward declarations	2020-03-23 09:29:45 +02:00
Botond Dénes	575466b2cf	test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh sstable_test.hh started as collection of utilities shared between the various `_sstable_test.cc` files. Predictably other tests started using it as well, among them some that are non boost unit tests. This poses a problem as if we add the missing boost/test/unit_test.hpp include to sstable_test.hh these tests will suddenly have missing symbols from boost::test. To avoid linking boost::test into all these users, extract utilities more widely used into sstable_utils.hh	2020-03-23 09:29:45 +02:00

1 2 3

138 Commits