scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 04:06:59 +00:00

Author	SHA1	Message	Date
Botond Dénes	0e78399051	test/lib: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	5fff314739	test/lib/simple_schema: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	d520655730	test/lib/sstable_utils: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	3679418e62	test/lib/test_services: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	0acc4d63da	test/lib/sstable_test_env: add reader_concurrency_semaphore member To enable tests using the test env to conveniently create permits for themselves, reducing the pain of migrating to local semaphores.	2021-07-08 15:28:39 +03:00
Botond Dénes	7174d1beee	test/lib/cql_test_env: add make_reader_permit() A convenience method, allowing tests using the cql test env to conveniently create a permit, reducing the pain of migrating to local semaphores.	2021-07-08 15:28:39 +03:00
Botond Dénes	b739525fb6	test/lib: add reader_concurrency_semaphore.hh Supplying a convenience semaphore wrapper, which stops the contained semaphore when destroyed. It also provides a more convenient `make_permit()`. This class is intended to make the migration to local semaphores less painful.	2021-07-08 15:28:36 +03:00
Botond Dénes	46d21e842d	test/lib/reader_lifecycle_policy: add permit parameter to factory function The factory method doesn't match the signature of `reader_lifecycle_policy::make_reader()`, notably the permit is missing. Add it as it is important that the wrapping evictable reader and underlying reader share the permits.	2021-07-08 12:31:36 +03:00
Botond Dénes	c4e71fb9b8	reader_concurrency_semaphore: remove default name parameter Naming the concurrency semaphore is currently optional, unnamed semaphores defaulting to "Unnamed semaphore". Although the most important semaphores are named, many still aren't, which makes for a poor debugging experience when one of these times out. To prevent this, remove the name parameter defaults from those constructors that have it and require a unique name to be passed in. Also update all sites creating a semaphore and make sure they use a unique name.	2021-07-08 12:31:36 +03:00
Raphael S. Carvalho	1924e8d2b6	treewide: Move compaction code into a new top-level compaction dir Since compaction is layered on top of sstables, let's move all compaction code into a new top-level directory. This change will give me extra motivation to remove all layer violations, like sstable calling compaction-specific code, and compaction entanglement with other components like table and storage service. Next steps: - remove all layer violations - move compaction code in sstables namespace into a new one for compaction. - move compaction unit tests into its own file Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>	2021-07-07 23:21:51 +03:00
Avi Kivity	99d5355007	Merge "Cache sstable indexes in memory" from Tomasz " The main goal of this series is to improve efficiency of reads from large partitions by reducing amount of I/O needed to read the sstable index. This is achieved by caching index file pages and partition index entries in memory. Currently, the pages are cached by individual reads only for the duration of the read. This was done to facilitate binary search in the promoted index (intra-partition index). After this series, all reads share the index file page cache, which stays around even after reads stop. The page cache is subject to eviction. It uses the same region as the current row cache and shares the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache entry to store the vtable pointer. SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the full partition index. This one is already kept in memory. The partition index is divided by the summary into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms identified by the clustering key (rows, tombstones). In order to read the promoted index, the reader needs to read the partition index entry first. To speed this up, this series also adds caching of partition index entries. This cache survives reads and is subject to eviction, just like the index file page cache. The unit of caching is the partition index page. Without this cache, each access to promoted index would have to be preceded with the parsing of the partition index page containing the partition key. Performance testing results follow. 1) scylla-bench large partition reads Populated with: perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \ -c1 -m1G --populate --value-size=1024 --rows=10000000 Single partition, 9G data file, 4MB index file Test execution: build/release/scylla -c1 -m4G scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \ -clustering-row-count 10000000 -duration 60m TL;DR: after: 2x throughput, 0.5 median latency Before (`c1daf2bb24`): Results Time (avg): 5m21.033180213s Total ops: 966951 Total rows: 966951 Operations/s: 3011.997048812112 Rows/s: 3011.997048812112 Latency: max: 74.055679ms 99.9th: 63.569919ms 99th: 41.320447ms 95th: 38.076415ms 90th: 37.158911ms median: 34.537471ms mean: 33.195994ms After: Results Time (avg): 5m14.706669345s Total ops: 2042831 Total rows: 2042831 Operations/s: 6491.22243800942 Rows/s: 6491.22243800942 Latency: max: 60.096511ms 99.9th: 35.520511ms 99th: 27.000831ms 95th: 23.986175ms 90th: 21.659647ms median: 15.040511ms mean: 15.402076ms 2) scylla-bench small partitions I tested several scenarios with a varying data set size, e.g. data fully fitting in memory, half fitting, and being much larger. The improvement varied a bit but in all cases the "after" code performed slightly better. Below is a representative run over data set which does not fit in memory. scylla -c1 -m4G scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \ -clustering-row-count 1 -duration 60m -no-lower-bound Before: Time (avg): 51.072411913s Total ops: 3165885 Total rows: 3165885 Operations/s: 61988.164024260645 Rows/s: 61988.164024260645 Latency: max: 34.045951ms 99.9th: 25.985023ms 99th: 23.298047ms 95th: 19.070975ms 90th: 17.530879ms median: 3.899391ms mean: 6.450616ms After: Time (avg): 50.232410679s Total ops: 3778863 Total rows: 3778863 Operations/s: 75227.58014424688 Rows/s: 75227.58014424688 Latency: max: 37.027839ms 99.9th: 24.805375ms 99th: 18.219007ms 95th: 14.090239ms 90th: 12.124159ms median: 4.030463ms mean: 5.315111ms The results include the warmup phase which populates the partition index cache, so the hot-cache effect is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which moves it lower. 3) perf_fast_forward --run-tests=large-partition-skips Caching is not used here, included to show there are no regressions for the cold cache case. TL;DR: No significant change perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G Config: rows: 10000000, value size: 2000 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5% 1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1% 1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5% 1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6% 1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3% 1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1% 1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3% 1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6% 1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8% 64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2% 64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5% 64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8% 64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8% 64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7% 64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7% 64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8% 64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4% 1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0% 1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0% 1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2% 1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8% 1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0% 1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9% 1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5% 1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3% 64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9% 64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1% 64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3% 64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2% 64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1% 64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2% 64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8% 64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2% 4) perf_fast_forward --run-tests=large-partition-slicing Caching enabled, each line shows the median run from many iterations TL;DR: We can observe reduction in IO which translates to reduction in execution time, especially for slicing in the middle of partition. perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases Config: rows: 10000000, value size: 2000 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0% 0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5% 0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6% 0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4% 5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2% 5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9% 5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8% 5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0% After: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2% 0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1% 0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4% 0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7% 5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8% 5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6% 5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6% 5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4% Future work: - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache which may reduce the hit ratio. - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size. - Disable cache population for "bypass cache" reads - Add a switch to disable sstable index caching, per-node, maybe per-table - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in the partition index page can be hot. - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the partition entry and then let binary search read the rest. In V2: - Fixed perf_fast_forward regression in the number of IOs used to read partition index page The reader uses 32K reads, which were split by page cache into 4K reads Fix by propagating IO size hints to page cache and using single IO to populate it. New patch: "cached_file: Issue single I/O for the whole read range on miss" - Avoid large allocations to store partition index page entries (due to managed_vector storage). There is a unit test which detects this and fails. Fixed by implementing chunked_managed_vector, based on chunked_vector. - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks - Simplify region_impl::free_buf() according to Avi's suggestions - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope. - Wire up system/drop_sstable_caches RESTful API - Fix use-after-move on permit for the old scanning ka/la index reader - Fixed more cases of double open_data() in tests leading to assert failure - Adjusted cached_file class doc to account for changes in behavior. - Rebased Fixes #7079. Refs #363. " * tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits) api: Drop sstable index caches on system/drop_sstable_caches cached_file: Issue single I/O for the whole read range on miss row_cache: cache_tracker: Do not register metrics when constructed for tests sstables, cached_file: Evict cache gently when sstable is destroyed sstables: Hide partition_index_cache implementation away from sstables.hh sstables: Drop shared_index_lists alias sstables: Destroy partition index cache gently sstables: Cache partition index pages in LSA and link to LRU utils: Introduce lsa::weak_ptr<> sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache sstables, cached_file: Avoid copying buffers from cache when parsing promoted index cached_file: Introduce get_page_units() sstables: read: Document that primitive_consumer::read_32() is alloc-free sstables: read: Count partition index page evictions sstables: Drop the _use_binary_search flag from index entries sstables: index_reader: Keep index objects under LSA lsa: chunked_managed_vector: Adapt more to managed_vector utils: lsa: chunked_managed_vector: Make LSA-aware test: chunked_managed_vector_test: Make exception_safe_class standard layout lsa: Copy chunked_vector to chunked_managed_vector ...	2021-07-07 18:17:10 +03:00
Pavel Solodovnikov	b959f5d394	test: lib: copy `query_options` in `single_node_cql_env::execute_cql()` `query_processor::execute_direct()` takes a non-const ref to query options, meaning it's not safe to pass the same instance to subsequent invocations of `execute_direct()` in the tests. Copy default query options at each invocation of `execute_cql()` so no possible side-effects can occur. Tests: unit(dev, debug) Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>	2021-07-07 11:46:50 +03:00
Tomasz Grabiec	2b673478aa	sstables: index_reader: Do not expose index_entry references index_entry will be an LSA-managed object. Those have to be accessed with care, with the LSA region locked. This patch hides most of direct index_entry accesses inside the index_reader so that users are safe.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	a955e7971d	sstables: index_reader: Don't store schema reference inside index_entry To save space.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	a5c72ed899	sstables, database: Keep cache_tracker reference inside sstables_manager So that sstable code can pick it up for caching (lru and region).	2021-07-02 10:25:58 +02:00
Konstantin Osipov	bd410da77a	raft: (service) rename raft_services service to raft_group_registry This is a more informative name. Helps see that, say, group0 is a separate service and not bundle all raft services together. Message-Id: <20210619211412.3035835-3-kostja@scylladb.com>	2021-06-21 14:53:54 +03:00
Konstantin Osipov	025f18325e	raft: (service) move raft service to namespace service Message-Id: <20210619211412.3035835-2-kostja@scylladb.com>	2021-06-21 14:53:54 +03:00
Avi Kivity	0948908502	Merge "mutation_reader: multishard_combining_reader clean-up close path" from Botond " The close path of the multishard combining reader is riddled with workarounds the fact that the flat mutation reader couldn't wait on futures when destroyed. Now that we have a close() method that can do just that, all these workarounds can be removed. Even more workarounds can be found in tests, where resources like the reader concurrency semaphore are created separately for each tested multishard reader and then destroyed after it doesn't need it, so we had to come up with all sorts of creative and ugly workarounds to keep these alive until background cleanup is finished. This series fixes all this. Now, after calling close on the multishard reader, all resources it used, including the life-cycle policy, the semaphores created by it can be safely destroyed. This greatly simplifies the handling of the multishard reader, and makes it much easier to reason about life-cycle dependencies. Tests: unit(dev, release:v2, debug:v2, mutation_reader_test:debug -t test_multishard, multishard_mutation_query_test:debug, multishard_combining_reader_as_mutation_source:debug) " * 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla: mutation_reader: reader_lifecycle_policy: remove convenience methods mutation_reader: multishard_combining_reader: store shard_reader via unique ptr test/lib/reader_lifecycle_policy: destroy_reader: cleanup context test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore test/lib/reader_lifecycle_policy: use a more robust eviction mechanism reader_concurrency_semaphore: wait for all permits to be destroyed in stop() test/lib/reader_lifcecycle_policy: fix indentation mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard reader_lifecycle_policy implementations: fix indentation mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter mutation_reader: shard_reader::close(): wait on the remote reader multishard_mutation_query: destroy remote parts in the foreground mutation_reader: shard_reader::close(): close _reader mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment	2021-06-16 17:25:50 +03:00
Botond Dénes	b4e69cf63d	test/lib/test_utils: require(): also log failed conditions Currently `require()` throws an exception when the condition fails. The problem with this is that the error is only printed at the end of the test, with no trace in the logs on where exactly it happened, compared to other logged events. This patchs also adds an error-level log line to address this. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>	2021-06-16 12:05:25 +03:00
Botond Dénes	a69db31b5c	test/lib/reader_lifecycle_policy: destroy_reader: cleanup context Now that we don't rely on any external machinery to keep the relevant parts of the context alive until needed as its life-cycle is effectively enclosed in that of the life-cycle policy itself, we can cleanup the context in `destroy_reader()` itself, avoiding a background trip back to this shard.	2021-06-16 11:29:36 +03:00
Botond Dénes	d2ddaced4e	test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds The lifecycle of the reader lifecycle policy and all the resources the reads use is now enclosed in that of the multishard reader thanks to its close() method. We can now remove all the workarounds we had in place to keep different resources as long as background reader cleanup finishes.	2021-06-16 11:29:36 +03:00
Botond Dénes	5a271e42a5	test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore So that when this method returns the semaphore is safe to destroy. This in turn will enable us to get rid of all the machinery we have in place to deal with the semaphore having to out-live the lifecycle policy without a clear time as to when it can be safe to destroy.	2021-06-16 11:29:36 +03:00
Botond Dénes	c09c62a0fb	test/lib/reader_lifecycle_policy: use a more robust eviction mechanism The test reader lifecycle policy has a mode in which it wants to ensure all inactive readers are evicted, so tests can stress reader recreation logic. For this it currently employs a trick of creating a waiter on the semaphore. I don't even know how this even works (or if it even does) but it sure complicates the lifecycle policy code a lot. So switch to the much more reliable and simple method of creating the semaphore with a single count and no memory. This ensures that all inactive reads are immediately evicted, while still allows a single read to be admitted at all times.	2021-06-16 11:29:36 +03:00
Botond Dénes	a10a6e253e	test/lib/reader_lifcecycle_policy: fix indentation Left broken from the previous patch.	2021-06-16 11:29:36 +03:00
Botond Dénes	8c7447effd	mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard Currently shard_reader::close() (its caller) goes to the remote shard, copies back all fragments left there to the local shard, then calls `destroy_reader()`, which in the case of the multishard mutation query copies it all back to the native shard. This was required before because `shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on `smp::submit_to()`. But close can, so we can get rid of all this back-and-forth and just call `destroy_reader()` on the shard the reader lives on, just like we do with `create_reader()`.	2021-06-16 11:29:35 +03:00
Botond Dénes	4ecf061c90	reader_lifecycle_policy implementations: fix indentation Left broken from the previous patch.	2021-06-16 11:21:38 +03:00
Botond Dénes	a7e59d3e2c	mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter The shard reader is now able to wait on the stopped reader and pass the already stopped reader to `destroy_reader()`, so we can de-futurize the reader parameter of said method. The shard reader was already patched to pass a ready future so adjusting the call-site is trivial. The most prominent implementation, the multishard mutation query, can now also drop its `_dismantling_gate` which was put in place so it can wait on the background stopping if readers. A consequence of this move is that handling errors that might happen during the stopping of the reader is now handled in the shard reader, not all lifecycle policy implementations.	2021-06-16 11:21:38 +03:00
Tomasz Grabiec	3fcd1f43ba	tests: mutation_source_test: Run tests with conversions inserted in the middle	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	cddcba27de	tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests() All readers are now flat so there is no need for this grouping. Will be needed for the next patch, which needs a single function with all test cases.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	ffb616fef6	tests: Add tests for flat_mutation_reader_v2	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	ed055db63e	tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	276c68c867	Clone flat_reader_assertions into flat_reader_assertions_v2	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	a13e7b30b7	test: lib: simple_schema: Reuse new_tombstone()	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	7e01679c99	test: lib: simple_schema: Accept tombstone in delete_range()	2021-06-16 00:23:49 +02:00
Nadav Har'El	3645c7104b	Merge: Wrap alternator start-stop into controller Merged patch series by Pavel Emelyanov: Alternator start and stop code is sitting inside the main() and it's a big piece of code out there. Havig it all in main complicates rework of start-stop sequences, it's much more handy to have it in alternator/. This set puts the mentioned code into transport- and thrift- like controller model. While doing it one more call for global storage service goes away. * 'br-alternator-clientize' of https://github.com/xemul/scylla: alternator: Move start-stop code into controller alternator: Move the whole starting code into a sched group alternator: Dont capture db, use cfg alternator: Controller skeleton alternator: Controller basement alternator: Drop storage service from executor	2021-06-14 15:44:10 +03:00
Michael Livshin	2bbc293e22	tests: improve error reporting of test_env::reusable_sst() Distinguish the "no such sstable" case from any reading errors. While at it, coroutinize the function. Refs #8785. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20210610113304.264922-1-michael.livshin@scylladb.com>	2021-06-11 19:06:43 +02:00
Pavel Emelyanov	773d2fe2a4	alternator: Drop storage service from executor It's completely unused in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-11 18:05:11 +03:00
Tomasz Grabiec	419ee84d86	Merge "sstable: validate first and last keys ordering" from Benny In #8772, an assert validating first token <= last token failed in leveled_manifest::overlapping. It is unclear how we got to that state, so add validation in sstable::set_first_and_last_keys() that the to-be-set first and last keys are well ordered. Otherwise, throw malformed_sstable_exception. set_first_and_last_keys is called both on the write path from the sstable writer before the sstable is sealed, and on the open/load path via update_info_for_opened_data(). This series also fixes issues with unit tests with regards to first/last keys so they won't fail the validation. Refs #8772 Test: unit(dev) DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug) * tag 'validate-first-and-last-keys-ordering-v1': sstable: validate first and last keys ordering test: lib: reusable_sst: save unexpected errors test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard test: sstable_test: define primary key in schema for compressed sstable	2021-06-09 14:43:02 +02:00
Tomasz Grabiec	ce7a404f17	Merge "Cleanups/refactoring for Raft Group 0" from Kostja * scylla-dev/raft-group-0-part-1-rebase: raft: (service) pass Raft service into storage_service raft: (service) add comments for boot steps raft: add ordering for raft::server_address based on id raft: (internal) simplify construction of tagged_id raft: (internal) tagged_id minor improvements	2021-06-09 10:48:05 +02:00
Konstantin Osipov	267a8e99ad	raft: (service) pass Raft service into storage_service Raft group 0 initialization and configuration changes should be integrated with Scylla cluster assembly, happening when starting the storage service and joining the cluster. Prepare for this. Since Raft service depends on query processor, and query processor depends on storage service, to break a dependency loop split Raft initialization into two steps: starting an under-constructed instance of "sharded" Raft service, accepting an under-constructed instance of "sharded" query_processor, and then passed into storage service start function, and then the local state of Raft groups from system tables once query processor starts. Consistently abbreviate raft_services instance raft_svcs, as is the convention at Scylla. Update the tests.	2021-06-08 14:52:32 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	872cd8f692	test: adjust copyright statement to use ScyllaDB rather than old name	2021-06-06 19:18:49 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	2187a59089	treewide: move `service::cas_request` out from `storage_proxy.hh` And remove all remaining inclusions of `storage_proxy.hh` in the headers. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Benny Halevy	7a4591119b	test: lib: reusable_sst: save unexpected errors reusable_sst tries openeing an sstable using all sstable format versions in descending order. It is expected to see "file not found" if the actual sstable version is not the latest one. That said, we may hit other error if the sstable is malformed in any way, so do not override this kind of error if "file not found" errors are hit after it, and return the unexpected error instead. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-06-02 12:25:29 +03:00
Avi Kivity	5f8484897b	Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged. --- Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in a new keyspace that uses EverywhereStrategy so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. The generation's timestamp and the UUID mentioned in the last point form a "generation identifier" for this new generation. For passing these new identifiers around, we introduce the cdc::generation_id_v2 type. Fixes #7961. --- For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order. dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/ unit tests (dev) passed locally Closes #8643 * github.com:scylladb/scylla: docs: update cdc.md with info about the new internal table sys_dist_ks: don't create old CDC generations table on service initialization sys_dist_ks: rename all_tables() to ensured_tables() cdc: when creating new generations, use format v2 if possible main: pass feature_service to cdc::generation_service gms: introduce CDC_GENERATIONS_V2 feature cdc: introduce retrieve_generation_data test: cdc: include new generations table in permissions test sys_dist_ks: increase timeout for create_cdc_desc sys_dist_ks: new table for exchanging CDC generations tree-wide: introduce cdc::generation_id_v2	2021-05-27 17:13:44 +03:00
Pavel Emelyanov	d2442a1bb3	tests: Ditch storage_service_for_tests The purpose of the class in question is to start sharded storage service to make its global instance alive. I don't know when exactly it happened but no code that instantiates this wrapper really needs the global storage service. Ref: #2795 tests: unit(dev), perf_sstable(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210526170454.15795-1-xemul@scylladb.com>	2021-05-27 14:39:13 +03:00
Avi Kivity	e2e723cc4c	build: enable -Wrange-loop-construct warning This warning triggers when a range for ("for (auto x : range)") causes non-trivial copies, prompting the developer to replace with a capture by reference. A few minor violations in the test suite are corrected. Closes #8699	2021-05-26 10:32:56 +03:00
Kamil Braun	4d3870b24b	main: pass feature_service to cdc::generation_service	2021-05-25 16:07:23 +02:00
Kamil Braun	f25e77c202	test: cdc: include new generations table in permissions test	2021-05-25 16:07:23 +02:00

1 2 3 4 5 ...

328 Commits