scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 13:45:53 +00:00

Author	SHA1	Message	Date
Avi Kivity	f756f34392	Merge "Add scylla-bench datasets to perf_fast_forward" from Tomasz " After this series one can use perf_fast_forward to generate the data set. It takes a lot less time this way than to use scylla-bench. " * 'perf-fast-forward-scylla-bench-dataset' of github.com:tgrabiec/scylla: tests: perf_fast_forward: Use data_source::make_ck() tests: perf_fast_forward: Move declaration of clustered_ds up tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default tests: perf_fast_forward: Add data sets which conform to scylla-bench schema	2021-07-08 17:33:30 +03:00
Nadav Har'El	d0546a9bb5	cql-pytest: improve README This patch adds to cql-pytest/README.md a paragraph on where run / run-cassandra expect to find Scylla or Cassandra, and how to override that choice. Also make a couple of trivial formatting changes. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210708142730.813660-1-nyh@scylladb.com>	2021-07-08 17:29:20 +03:00
Avi Kivity	4f1e21ceac	Merge "reader_concurrency_semaphore: get rid of global semaphores" from Botond " When obtaining a valid permit was made mandatory, code which now had to create reader permits but didn't have a semaphore handy suddenly found itself in a difficult situation. Many places and most prominently tests solved the problem by creating a thread-local semaphore to source permits from. This was fine at the time but as usual, globals came back to haunt us when `reader_concurrency_semaphore::stop()` was introduced, as these global semaphores had no easy way to be stopped before being destroyed. This patch-set cleans up this wart, by getting rid of all global semaphores, replacing them with appropriately scoped local semaphores, that are stopped after being used. With that, the FIXME in `~reader_concurrency_semaphore()` can be resolved and we an finally `assert()` that the semaphore was stopped before being destroyed. This series is another preparatory one for the series which moves the semaphore in front of the cache. tests: unit(dev) " * 'reader-concurrency-semaphore-mandatory-stop/v2' of https://github.com/denesb/scylla: (26 commits) reader_concurrency_semaphore: assert(_stopped) in the destructor test/lib: remove now unused reader_permit.{hh,cc} test/boost: migrate off the global test reader semaphore test/manual: migrate off the global test reader semaphore test/unit: migrate off the global test reader semaphore test/perf: migrate off the global test reader semaphore test/perf: perf.hh: add reader_concurrency_semaphore_wrapper test/lib: migrate off the global test reader semaphore test/lib/simple_schema: migrate off the global test reader semaphore test/lib/sstable_utils: migrate off the global test reader semaphore test/lib/test_services: migrate off the global test reader semaphore test/lib/sstable_test_env: add reader_concurrency_semaphore member test/lib/cql_test_env: add make_reader_permit() test/lib: add reader_concurrency_semaphore.hh test/boost/sstable_test: migrate row counting tests to seastar thread test/boost/sstable_test: test_using_reusable_sst(): pass env to func test/lib/reader_lifecycle_policy: add permit parameter to factory function test/boost/mutation_reader_test: share permit between readers in a read memtable: migrate off the global reader concurrency semaphore mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore ...	2021-07-08 17:28:13 +03:00
Botond Dénes	42bd5c980f	reader_concurrency_semaphore: assert(_stopped) in the destructor Now that there are no more global semaphore which are impossible to stop properly we can resolve the related FIXME and arm the assert in the semaphore destructor. We can also remove all the other cleanup code from the destructor as they are taken care of by stop(), which we now assert to have been run.	2021-07-08 16:53:38 +03:00
Botond Dénes	6b941c4d34	test/lib: remove now unused reader_permit.{hh,cc} Finally getting rid of the global test reader concurrency semaphore.	2021-07-08 16:53:38 +03:00
Botond Dénes	2d2b9e7b36	test/boost: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	0bf07cde7b	test/manual: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	18e0c40c5d	test/unit: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	37a1e506b1	test/perf: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	2454811dd6	test/perf: perf.hh: add reader_concurrency_semaphore_wrapper A convenience, self-closing wrapper for those perf tests that have no way to stop the semaphore and wait for it too.	2021-07-08 16:53:38 +03:00
Nadav Har'El	e22a52e69c	cql-pytest: fix tests on Cassandra 3 After commit `76227fa` ("cql-pytest: use NetworkTopologyStrategy, not SimpleStrategy"), the cql-pytest tests now NetworkTopologyStrategy instead of SimpleStrategy in the test keyspaces. The tests continued to use the "replication_factor" option. The support for this option is a relatively recent, and was only added to Cassandra in the 4.0 release series (see https://issues.apache.org/jira/browse/CASSANDRA-14303). So users who happen to have Cassandra 3 installed and want to run a cql-pytest against it will see the test failing when it can't create a keyspace. This patch trivially fixes the problem by using the name of the current DC (automatically determined) instead of the word 'replication_factor'. Almost all tests are fixed by a single fix to the test_keyspace fixture which creates one keyspace used by most tests. Additional changes were needed in test_keyspace.py, for tests which explicitly create keyspaces. I tested the result on Cassandra 3.11.10, Cassandra 4 (git master) and Scylla. Fixes #8990 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210708123428.811184-1-nyh@scylladb.com>	2021-07-08 15:35:21 +02:00
Nadav Har'El	eb11ce046c	cql-pytest: add reproducer for concurrent DROP KEYSPACE bug We know that today in Scylla concurrent schema changes done on different coordinators are not safe - and we plan to address this problem with Raft. However, the test in this patch - reproducing issue #8968 - demonstrates that even on a single node concurrent schema changes are not safe: The test involves one thread which constantly creates a keyspace and then a table in it - and a second thread which constantly deletes this keyspace. After doing this for a while, the schema reaches an inconsistent state: The keyspace is at a state of limbo where it cannot be dropped (dropping it succeeds, but doesn't actually drop it), and a new keyspace cannot be created under the same name). Note that to reproduce this bug, it was important that the test create both a keyspace and a table. Were the test to just create an empty keyspace, without a table in it, the bug would not be reproduced. Refs #8968. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210704121049.662169-1-nyh@scylladb.com>	2021-07-08 15:35:03 +02:00
Botond Dénes	0e78399051	test/lib: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	5fff314739	test/lib/simple_schema: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	d520655730	test/lib/sstable_utils: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	3679418e62	test/lib/test_services: migrate off the global test reader semaphore	2021-07-08 15:28:39 +03:00
Botond Dénes	0acc4d63da	test/lib/sstable_test_env: add reader_concurrency_semaphore member To enable tests using the test env to conveniently create permits for themselves, reducing the pain of migrating to local semaphores.	2021-07-08 15:28:39 +03:00
Botond Dénes	7174d1beee	test/lib/cql_test_env: add make_reader_permit() A convenience method, allowing tests using the cql test env to conveniently create a permit, reducing the pain of migrating to local semaphores.	2021-07-08 15:28:39 +03:00
Botond Dénes	b739525fb6	test/lib: add reader_concurrency_semaphore.hh Supplying a convenience semaphore wrapper, which stops the contained semaphore when destroyed. It also provides a more convenient `make_permit()`. This class is intended to make the migration to local semaphores less painful.	2021-07-08 15:28:36 +03:00
Benny Halevy	fa5d70da32	storage_proxy: abstract_read_resolver: handle semaphore_timed_out error semaphore_timed_out errors should be ignored, similar to rpc::timeout_error or seastar::timed_out_error, so that they eventually be converted to `read_timeout_exception` via the data/digest read resolver on_timeout() method. Otherwise, the semaphore timeout is mistranslated to read_failure_exception, via on_error(). Note that originally the intention was to change the exception thrown by the reader_concurrency_semaphore expiry_handler, but there are already several places in the code that catch and handle the semaphore_timed_out exception that would need to be changed, increasing the risk in this change. Fixes #8958 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210708083252.1934651-2-bhalevy@scylladb.com>	2021-07-08 15:23:30 +03:00
Benny Halevy	023d103fee	utils: exceptions: is_timeout_exception: add timed_out_error Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210708083252.1934651-1-bhalevy@scylladb.com>	2021-07-08 15:23:29 +03:00
Nadav Har'El	814c4ad4ce	cql-pytest: fix run-cassandra for older versions of Cassandra In older versions of Cassandra (such as 3.11.10 which I tried), the CQL server is not turned on by default, unless the configuration file explicitly has "start_native_transport: true" - without it only the Thrift server is started. So fix the cql-pytest/run-cassandra to pass this option. It also works correctly in Cassandra 4. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210708113423.804980-1-nyh@scylladb.com>	2021-07-08 14:59:09 +03:00
Avi Kivity	7d214800d0	Merge 'Generate view updates in smaller parts' from Piotr Sarna In order to avoid large allocations and too large mutations generated from large view updates, granularity of the process is broken down from per-partition to smaller chunks. The view update builder now produces partial updates, no more than 100 view rows at a time. The series was tested manually with a particular scenario in mind - deleting a large base partition, which results in creating a view update per each deleted row - which, with sufficiently large partitions, can reach millions. Before the series, Scylla experienced an out-of-memory condition after the view update generation mechanism tried to load too much data into a contiguous buffer. Multiple large allocation warnings and reactor stalls were observed as well. After the series, the operation is still rather slow, but does not induce reactor stalls nor allocator problems. A reduced version of the above test is added as a unit test - it does not check for huge partitions, but instead uses a number just large enough to cause the update generation process to be split into multiple chunks. Fixes #8852 Closes #8906 * github.com:scylladb/scylla: cql-pytest: add a test case for base range deletion cql-pytest: add a test case for base partition deletion table: elaborate on why exceptions are ignored for view updates view: generate view updates in smaller parts table: coroutinize generating view updates db,view: move view_update_builder to the header	2021-07-08 12:57:05 +03:00
Piotr Sarna	bc0038913c	cql-pytest: add a test case for base range deletion The test case checks that deleting a base table clustering range works fine. This operation is potentially heavy, as it involves generating a view update for every row. With large enough ranges, the number can reach millions and beyond.	2021-07-08 11:43:08 +02:00
Piotr Sarna	ef47b4565c	cql-pytest: add a test case for base partition deletion The test case checks that deleting a whole base table partition works fine. This operation is potentially heavy, as it involves generating a view update for every row. With large enough partitions, the number can reach millions and beyond.	2021-07-08 11:42:54 +02:00
Botond Dénes	b9a5fd57bf	test/boost/sstable_test: migrate row counting tests to seastar thread To facilitate further patching.	2021-07-08 12:38:21 +03:00
Botond Dénes	fb310ec6e7	test/boost/sstable_test: test_using_reusable_sst(): pass env to func To facilitate further patching.	2021-07-08 12:38:19 +03:00
Botond Dénes	46d21e842d	test/lib/reader_lifecycle_policy: add permit parameter to factory function The factory method doesn't match the signature of `reader_lifecycle_policy::make_reader()`, notably the permit is missing. Add it as it is important that the wrapping evictable reader and underlying reader share the permits.	2021-07-08 12:31:36 +03:00
Botond Dénes	2a45d643b6	test/boost/mutation_reader_test: share permit between readers in a read Permits were designed such that there is one permit per read, being shared by all readers in that read. Make sure readers created by tests adhere to this.	2021-07-08 12:31:36 +03:00
Botond Dénes	0f36e5c498	memtable: migrate off the global reader concurrency semaphore Require the caller of `create_flush_reader()` to pass a permit instead.	2021-07-08 12:31:36 +03:00
Botond Dénes	7a4381b491	mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore Use a local one instead, and stop it when the writer is destroyed.	2021-07-08 12:31:36 +03:00
Botond Dénes	17a0e22cb1	sstables: mx/writer: migrate off the global reader concurrency_semaphore And use a local one instead, stopping it when the writer is destroyed.	2021-07-08 12:31:36 +03:00
Botond Dénes	f1c1e05a05	sstables: stop semaphores	2021-07-08 12:31:36 +03:00
Botond Dénes	c51892f02e	sstables: sstable::has_partition_key(): convert to coroutine	2021-07-08 12:31:36 +03:00
Botond Dénes	c0a8068c16	sstables: generate_summary(): fix indentation	2021-07-08 12:31:36 +03:00
Botond Dénes	fec137f3f6	sstables: generate_summary(): make it a coroutine Indentation is left broken.	2021-07-08 12:31:36 +03:00
Botond Dénes	c4e71fb9b8	reader_concurrency_semaphore: remove default name parameter Naming the concurrency semaphore is currently optional, unnamed semaphores defaulting to "Unnamed semaphore". Although the most important semaphores are named, many still aren't, which makes for a poor debugging experience when one of these times out. To prevent this, remove the name parameter defaults from those constructors that have it and require a unique name to be passed in. Also update all sites creating a semaphore and make sure they use a unique name.	2021-07-08 12:31:36 +03:00
Piotr Sarna	6a461d00c6	table: elaborate on why exceptions are ignored for view updates The generate_and_propagate_view_updates() function explicitly ignores exceptions reported from the underlying view update propagation layer. This decision is now explained in the comment.	2021-07-08 11:21:55 +02:00
Piotr Sarna	bf0777e97a	view: generate view updates in smaller parts In order to avoid large allocations and too large mutations generated from large view updates, granularity of the process is broken down from per-partition to smaller chunks. The view update builder now produces partial updates, no more than 100 view rows at a time.	2021-07-08 11:17:27 +02:00
Piotr Sarna	1000d52cfa	table: coroutinize generating view updates ... which will make the incoming changes easier to review.	2021-07-08 11:17:27 +02:00
Piotr Sarna	679dc4d824	db,view: move view_update_builder to the header The builder is going to be used directly by the callers, which requires making its definition public. No semantic changes were intended.	2021-07-08 11:17:27 +02:00
Raphael S. Carvalho	1924e8d2b6	treewide: Move compaction code into a new top-level compaction dir Since compaction is layered on top of sstables, let's move all compaction code into a new top-level directory. This change will give me extra motivation to remove all layer violations, like sstable calling compaction-specific code, and compaction entanglement with other components like table and storage service. Next steps: - remove all layer violations - move compaction code in sstables namespace into a new one for compaction. - move compaction unit tests into its own file Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>	2021-07-07 23:21:51 +03:00
Tomasz Grabiec	33cba08735	tests: perf_fast_forward: Use data_source::make_ck() Data sources differ in clustering key type. Make sure to use the right data_value instance to produce correct keys.	2021-07-07 20:27:44 +02:00
Tomasz Grabiec	fa481e92c1	tests: perf_fast_forward: Move declaration of clustered_ds up	2021-07-07 20:27:44 +02:00
Tomasz Grabiec	407e42f5d8	tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default This dataset exists for convenience, to be able to run scylla-bench against the data set generated by perf_fast_forward. It doesn't increase coverage. So do not include it by default to not waste resources on it.	2021-07-07 20:27:44 +02:00
Tomasz Grabiec	d7250a12fd	tests: perf_fast_forward: Add data sets which conform to scylla-bench schema Useful for fast generation of test data.	2021-07-07 20:27:44 +02:00
Avi Kivity	5571ef0d6d	compression: define 'class' attribute for compression and deprecate 'sstable_compression' Cassandra 3.0 deprecated the 'sstable_compression' attribute and added 'class' as a replacement. Follow by supporting both. The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED to detect all uses and prevent future misuse. To prevent old-version nodes from seeing the new name, the compression_parameters class preserves the key name when it is constructed from an options map, and emits the same key name when asked to generate an options map. Existing unit tests are modified to use the new name, and a test is added to ensure the old name is still supported. Fixes #8948. Closes #8949	2021-07-07 19:15:20 +02:00
Avi Kivity	99d5355007	Merge "Cache sstable indexes in memory" from Tomasz " The main goal of this series is to improve efficiency of reads from large partitions by reducing amount of I/O needed to read the sstable index. This is achieved by caching index file pages and partition index entries in memory. Currently, the pages are cached by individual reads only for the duration of the read. This was done to facilitate binary search in the promoted index (intra-partition index). After this series, all reads share the index file page cache, which stays around even after reads stop. The page cache is subject to eviction. It uses the same region as the current row cache and shares the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache entry to store the vtable pointer. SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the full partition index. This one is already kept in memory. The partition index is divided by the summary into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms identified by the clustering key (rows, tombstones). In order to read the promoted index, the reader needs to read the partition index entry first. To speed this up, this series also adds caching of partition index entries. This cache survives reads and is subject to eviction, just like the index file page cache. The unit of caching is the partition index page. Without this cache, each access to promoted index would have to be preceded with the parsing of the partition index page containing the partition key. Performance testing results follow. 1) scylla-bench large partition reads Populated with: perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \ -c1 -m1G --populate --value-size=1024 --rows=10000000 Single partition, 9G data file, 4MB index file Test execution: build/release/scylla -c1 -m4G scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \ -clustering-row-count 10000000 -duration 60m TL;DR: after: 2x throughput, 0.5 median latency Before (`c1daf2bb24`): Results Time (avg): 5m21.033180213s Total ops: 966951 Total rows: 966951 Operations/s: 3011.997048812112 Rows/s: 3011.997048812112 Latency: max: 74.055679ms 99.9th: 63.569919ms 99th: 41.320447ms 95th: 38.076415ms 90th: 37.158911ms median: 34.537471ms mean: 33.195994ms After: Results Time (avg): 5m14.706669345s Total ops: 2042831 Total rows: 2042831 Operations/s: 6491.22243800942 Rows/s: 6491.22243800942 Latency: max: 60.096511ms 99.9th: 35.520511ms 99th: 27.000831ms 95th: 23.986175ms 90th: 21.659647ms median: 15.040511ms mean: 15.402076ms 2) scylla-bench small partitions I tested several scenarios with a varying data set size, e.g. data fully fitting in memory, half fitting, and being much larger. The improvement varied a bit but in all cases the "after" code performed slightly better. Below is a representative run over data set which does not fit in memory. scylla -c1 -m4G scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \ -clustering-row-count 1 -duration 60m -no-lower-bound Before: Time (avg): 51.072411913s Total ops: 3165885 Total rows: 3165885 Operations/s: 61988.164024260645 Rows/s: 61988.164024260645 Latency: max: 34.045951ms 99.9th: 25.985023ms 99th: 23.298047ms 95th: 19.070975ms 90th: 17.530879ms median: 3.899391ms mean: 6.450616ms After: Time (avg): 50.232410679s Total ops: 3778863 Total rows: 3778863 Operations/s: 75227.58014424688 Rows/s: 75227.58014424688 Latency: max: 37.027839ms 99.9th: 24.805375ms 99th: 18.219007ms 95th: 14.090239ms 90th: 12.124159ms median: 4.030463ms mean: 5.315111ms The results include the warmup phase which populates the partition index cache, so the hot-cache effect is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which moves it lower. 3) perf_fast_forward --run-tests=large-partition-skips Caching is not used here, included to show there are no regressions for the cold cache case. TL;DR: No significant change perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G Config: rows: 10000000, value size: 2000 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5% 1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1% 1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5% 1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6% 1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3% 1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1% 1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3% 1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6% 1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8% 64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2% 64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5% 64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8% 64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8% 64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7% 64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7% 64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8% 64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4% 1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0% 1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0% 1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2% 1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8% 1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0% 1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9% 1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5% 1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3% 64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9% 64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1% 64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3% 64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2% 64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1% 64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2% 64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8% 64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2% 4) perf_fast_forward --run-tests=large-partition-slicing Caching enabled, each line shows the median run from many iterations TL;DR: We can observe reduction in IO which translates to reduction in execution time, especially for slicing in the middle of partition. perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases Config: rows: 10000000, value size: 2000 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0% 0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5% 0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6% 0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4% 5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2% 5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9% 5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8% 5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0% After: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2% 0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1% 0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4% 0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7% 5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8% 5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6% 5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6% 5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4% Future work: - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache which may reduce the hit ratio. - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size. - Disable cache population for "bypass cache" reads - Add a switch to disable sstable index caching, per-node, maybe per-table - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in the partition index page can be hot. - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the partition entry and then let binary search read the rest. In V2: - Fixed perf_fast_forward regression in the number of IOs used to read partition index page The reader uses 32K reads, which were split by page cache into 4K reads Fix by propagating IO size hints to page cache and using single IO to populate it. New patch: "cached_file: Issue single I/O for the whole read range on miss" - Avoid large allocations to store partition index page entries (due to managed_vector storage). There is a unit test which detects this and fails. Fixed by implementing chunked_managed_vector, based on chunked_vector. - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks - Simplify region_impl::free_buf() according to Avi's suggestions - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope. - Wire up system/drop_sstable_caches RESTful API - Fix use-after-move on permit for the old scanning ka/la index reader - Fixed more cases of double open_data() in tests leading to assert failure - Adjusted cached_file class doc to account for changes in behavior. - Rebased Fixes #7079. Refs #363. " * tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits) api: Drop sstable index caches on system/drop_sstable_caches cached_file: Issue single I/O for the whole read range on miss row_cache: cache_tracker: Do not register metrics when constructed for tests sstables, cached_file: Evict cache gently when sstable is destroyed sstables: Hide partition_index_cache implementation away from sstables.hh sstables: Drop shared_index_lists alias sstables: Destroy partition index cache gently sstables: Cache partition index pages in LSA and link to LRU utils: Introduce lsa::weak_ptr<> sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache sstables, cached_file: Avoid copying buffers from cache when parsing promoted index cached_file: Introduce get_page_units() sstables: read: Document that primitive_consumer::read_32() is alloc-free sstables: read: Count partition index page evictions sstables: Drop the _use_binary_search flag from index entries sstables: index_reader: Keep index objects under LSA lsa: chunked_managed_vector: Adapt more to managed_vector utils: lsa: chunked_managed_vector: Make LSA-aware test: chunked_managed_vector_test: Make exception_safe_class standard layout lsa: Copy chunked_vector to chunked_managed_vector ...	2021-07-07 18:17:10 +03:00
Takuya ASADA	def81807aa	scylla-fstrim.timer: drop BindsTo=scylla-server.service To avoid restart scylla-server.service unexpectedly, drop BindsTo= from scylla-fstrim.timer. Fixes #8921 Closes #8973	2021-07-07 17:36:24 +03:00
Dejan Mircevski	7d6ef0de8d	cql3: Drop more dead code After `845e36e76` "cql3: Use expr for global-index partition slice", there is actually more dead code than was initially dropped. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8981	2021-07-07 13:59:58 +02:00

1 2 3 4 5 ...

27251 Commits