scylladb

Author	SHA1	Message	Date
Avi Kivity	392e24d199	Merge "Unglobal messaging service" from Pavel E " The messaging service is (as many other services) present in the global namespace and is widely accessed from where needed with global get(_local)?_messaging_service() calls. There's a long-term task to get rid of this globality and make services and componenets reference each-other and, for and due-to this, start and stop in specific order. This set makes this for the messaging service. The service is very low level and doesn't depend on anything. It's used by gossiper, streaming, repair, migration manager, storage proxy, storage service and API. According to this dependencies the set consists of several parts: patches 1-9 are preparatory, they encapsulate messaging service init/fini stuff in its own module and decouple it from the db::config patch 10-12 introduce local service reference in main and set its init/fini calls at the early stage so that this reference can later be passed to those depending on it patches 13-42 replace global referencing of messaging service from other subsystems with local references initialized from main. patch 43 finalizes tests. patch 44 wraps things up with removing global messaiging service instance along with get(_local)?_messaging_service calls. The service's stopping part is deliberately left incomplete (as it is now), the sharded service remains alive, only the instance's stop() method is called (and is empty for a while). Since the messaging service's users still do not stop cleanly, its instances should better continue leaking on exit. Once (if) the seastar gets the helper rpc::has_handlers() method merged the messaging_service::stop() will be able to check if all the verbs had been unregistered (spoiler: not yet, more fixes to come). For debugging purposes the pointer on now-local messaging service instance is kept in service::debug namespace. tests: unit(dev) dtest(dev: simple_boot_shutdown, repair, update_cluster_layout) manual start-stop " * 'br-unglobal-messaging-service-2' of https://github.com/xemul/scylla: (44 commits) messaging_service: Unglobal messaging service instance tests: Use own instances of messaging_service storage_service: Use local messaging reference storage_service: Keep reference on sharded messaging service migration_manager: Add messaging service as argument to get_schema_definition migration_manager: Use local messaging reference in simple cases migration_manager: Keep reference on messaging migration_manager: Make push_schema_mutation private non-static method migration_manager: Move get_schema_version verb handling from proxy repair: Stop using global messaging_service references repair: Keep sharded messaging service reference on repair_meta repair: Keep sharded messaging service reference on repair_info repair: Keep reference on messaging in row-level code repair: Keep sharded messaging service in API repair: Unset API endpoints on stop repair: Setup API endpoints in separate helper repair: Push the sharded<messaging_service> reference down to sync_data_using_repair repair: Use existing sharded db reference repair: Mark repair.cc local functions as static streaming: Keep messaging service on send_info ...	2020-08-20 12:20:36 +03:00
Pavel Emelyanov	ee41645a1a	tests: Use own instances of messaging_service The global one is going away, no core code uses it, so all tests can be safely switched to use their own instances. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	4ea3c2797c	storage_service: Keep reference on sharded messaging service It is a bit step backward in the storage-service decompsition campaign, but... Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	6c49127d04	migration_manager: Keep reference on messaging That's another user of messaging service, init it with private reference. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:53 +03:00
Pavel Emelyanov	24cb1b781f	storage_proxy: Keep reference on messaging The proxy is another user of messaging, so keep the reference on it. Its real usage will come in next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Pavel Emelyanov	65bd54604d	gossiper: Use messaging service by reference Gossiper needs messaging service, the messaging is started before the gossiper, so we can push the former reference into it. Gossiper is not stopped for real, neither the messaging service is, so the memory usage is still safe. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 20:50:52 +03:00
Botond Dénes	6ad80f0adb	test/lib/cql_test_env: set debug::db pointer To allow using scylla-gdb.py scripts for debugging tests. These scripts expect a valid database pointer in `debug::db`. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200819145632.2423462-1-bdenes@scylladb.com>	2020-08-19 19:13:05 +03:00
Pavel Emelyanov	dc0918e255	tests: Keep local reference on global messaging Some tests directly reference the global messaging service. For the sake of simpler patching wrap this global reference with a local one. Once the global messaging service goes away tests will get their own instances. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-08-19 13:08:12 +03:00
Raphael S. Carvalho	81ec49c82f	sstables/sstable_set: rename method to retrieve sstable runs select() is too generic for the method that retrieve sstable runs, and it has a completely different meaning that the former select method used to select sstables based on token range. let's give it a more descriptive name. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200811193401.22749-1-raphaelsc@scylladb.com>	2020-08-16 17:41:16 +03:00
Piotr Jastrzebski	c001374636	codebase wide: replace count with contains C++20 introduced `contains` member functions for maps and sets for checking whether an element is present in the collection. Previously `count` function was often used in various ways. `contains` does not only express the intend of the code better but also does it in more unified way. This commit replaces all the occurences of the `count` with the `contains`. Tests: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>	2020-08-15 20:26:02 +03:00
Botond Dénes	1d48442ae7	test/lib/mutation_source_test: test-monotonic-positions: test the reader-under-test Instead of always testing `flat_mutation_reader_from_mutations()`. Tests: unit(dev, debug) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200812073406.1681250-1-bdenes@scylladb.com>	2020-08-12 10:52:26 +03:00
Rafael Ávila de Espíndola	aa2476d7ac	test: Move code in sstable_run_based_compaction_strategy_for_tests.hh out of line Most of this is virtual and it is all test code. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-08-11 11:49:49 -07:00
Rafael Ávila de Espíndola	ef6a52a407	test: Drop ifdef now that we always use c++20 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-08-11 11:49:20 -07:00
Rafael Ávila de Espíndola	bd2f9fc685	test: Move sstable_run_based_compaction_strategy_for_tests.hh to test/lib This is in preparation to moving the code to a .cc file. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-08-11 11:48:41 -07:00
Avi Kivity	3530e80ce1	Merge "Support md format" from Benny " This series adds support for the "md" sstable format. Support is based on the following: * do not use clustering based filtering in the presence of static row, tombstones. * Disabling min/max column names in the metadata for formats older than "md". * When updating the metadata, reset and disable min/max in the presence of range tombstones (like Cassandra does and until we process them accurately). * Fix the way we maintain min/max column names by: keeping whole clustering key prefixes as min/max rather than calculating min/max independently for each component, like Cassandra does in the "md" format. Fixes #4442 Tests: unit(dev), cql_query_test -t test_clustering_filtering* (debug) md migration_test dtest from git@github.com:bhalevy/scylla-dtest.git migration_test-md-v1 " * tag 'md-format-v4' of github.com:bhalevy/scylla: (27 commits) config: enable_sstables_md_format by default test: cql_query_test: add test_clustering_filtering unit tests table: filter_sstable_for_reader: allow clustering filtering md-format sstables table: create_single_key_sstable_reader: emit partition_start/end for empty filtered results table: filter_sstable_for_reader: adjust to md-format table: filter_sstable_for_reader: include non-scylla sstables with tombstones table: filter_sstable_for_reader: do not filter if static column is requested table: filter_sstable_for_reader: refactor clustering filtering conditional expression features: add MD_SSTABLE_FORMAT cluster feature config: add enable_sstables_md_format database: add set_format_by_config test: sstable_3_x_test: test both mc and md versions test: Add support for the "md" format sstables: mx/writer: use version from sstable for write calls sstables: mx/writer: update_min_max_components for partition tombstone sstables: metadata_collector: support min_max_components for range tombstones sstable: validate_min_max_metadata: drop outdated logic sstables: rename mc folder to mx sstables: may_contain_rows: always true for old formats sstables: add may_contain_rows ...	2020-08-11 13:29:11 +03:00
Benny Halevy	65239a6e50	config: add enable_sstables_md_format MD format is disabled by default at this point. The option extends enable_sstables_mc_format so that both are needed to be set for supporting the md format. The MD_FORMAT cluster feature will be added in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 18:53:04 +03:00
Benny Halevy	8e0e2c8a48	database: add set_format_by_config This is required for test applications that may select a sstable format different than the default mc format, like perf_fast_forward. These apps don't use the gossip-based sstables_format_selector to set the format based on the cluster feature and so they need to rely on the db config. Call set_format_by_config in single_node_cql_env::do_with. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-10 18:53:04 +03:00
Benny Halevy	69f7454d88	sstable: mark read_toc and methods calling it noexcept read_toc can be marked as noexcept now that new_sstable_component_file is. With that, other methods that call it can be marked noexcept too. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2020-08-09 12:04:36 +03:00
Rafael Ávila de Espíndola	b1315a2120	cql_test_env: Delay starting the compaction manager In case of an initialization failure after db.get_compaction_manager().enable(); But before stop_database, we would never stop the compaction manager and it would assert during destruction. I am trying to add a test for this using the memory failure injector, but that will require fixing other crashes first. Found while debugging #6831. Refs #6831. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200805181840.196064-1-espindola@scylladb.com>	2020-08-06 16:07:16 +03:00
Rafael Ávila de Espíndola	ef0bed7253	Drop duplicated 'if' in comment Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200730170109.5789-1-espindola@scylladb.com>	2020-08-04 07:53:34 +03:00
Calle Wilund	30a700c5b0	system_keyspace: Remove support for legacy truncation records Fixes #6341 Since scylla no longer supports upgrading from a version without the "new" (dedicated) truncation record table, we can remove support for these and the migtration thereof. Make sure the above holds whereever this is committed. Note that this does not remove the "truncated_at" field in system.local.	2020-08-03 17:16:26 +03:00
Avi Kivity	257c17a87a	Merge "Don't depend on seastar::make_(lw_)?shared idiosyncrasies" from Rafael " While working on another patch I was getting odd compiler errors saying that a call to ::make_shared was ambiguous. The reason was that seastar has both: template <typename T, typename... A> shared_ptr<T> make_shared(A&&... a); template <typename T> shared_ptr<T> make_shared(T&& a); The second variant doesn't exist in std::make_shared. This series drops the dependency in scylla, so that a future change can make seastar::make_shared a bit more like std::make_shared. " * 'espindola/make_shared' of https://github.com/espindola/scylla: Everywhere: Explicitly instantiate make_lw_shared Everywhere: Add a make_shared_schema helper Everywhere: Explicitly instantiate make_shared cql3: Add a create_multi_column_relation helper main: Return a shared_ptr from defer_verbose_shutdown	2020-08-02 19:51:24 +03:00
Rafael Ávila de Espíndola	a548e5f5d1	test: Mark tmpdir::remove noexcept Also disable the allocation failure injection in it. Refs #6831. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200729200019.250908-2-espindola@scylladb.com>	2020-07-30 09:55:52 +03:00
Rafael Ávila de Espíndola	d8ba9678b4	test: Move tmpdir code to a .cc file This is not hot, so we can move it out of the header. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200729200019.250908-1-espindola@scylladb.com>	2020-07-30 09:55:52 +03:00
Botond Dénes	43c0da4b63	test: cql_test_env: set the max_memory_unlimited_query_{soft,hard}_limit To an unlimited value, in order to avoid aborting any unpaged queries executed by tests, that would exceed the default result limit of 1MB/100MB.	2020-07-28 18:00:29 +03:00
Botond Dénes	92ce39f014	query: query_class_config: use max_result_size for the max_memory_for_unlimited_query field We want to switch from using a single limit to a dual soft/hard limit. As a first step we switch the limit field of `query_class_config` to use the recently introduced type for this. As this field has a single user at the moment -- reverse queries (and not a lot of propagation) -- we update it in this same patch to use the soft/hard limit: warn on reaching the soft limit and abort on the hard limit (the previous behaviour).	2020-07-28 18:00:29 +03:00
Botond Dénes	517a941feb	query_class_config: move into the query namespace It belongs there, its name even starts with "query".	2020-07-28 18:00:29 +03:00
Rafael Ávila de Espíndola	e15c8ee667	Everywhere: Explicitly instantiate make_lw_shared seastar::make_lw_shared has a constructor taking a T&&. There is no such constructor in std::make_shared: https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared This means that we have to move from make_lw_shared(T(...) to make_lw_shared<T>(...) If we don't want to depend on the idiosyncrasies of seastar::make_lw_shared. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-21 10:33:49 -07:00
Botond Dénes	f264d2b00f	test: cql_test_env: allow overriding database_config	2020-07-20 11:23:39 +03:00
Rafael Ávila de Espíndola	b10beead61	memtable_snapshot_source: Avoid a std::bad_alloc crash _should_compact is a condition_variable and condition_variable::wait() allocates memory. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200706223201.903072-1-espindola@scylladb.com>	2020-07-08 15:21:50 +02:00
Rafael Ávila de Espíndola	33af0c293f	cql_test_env: Make ks_name a constexpr std::string_view Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-07-03 12:28:20 -07:00
Botond Dénes	63309f925c	mutation_reader: reader_lifecycle_policy: make semaphore() available early Currently all reader lifecycle policy implementations assume that `semaphore()` will only be called after at least one call to `make_reader()`. This assumption will soon not hold, so make sure `semaphore()` can be called at any time, including before any calls are made to `make_reader()`.	2020-06-23 10:01:38 +03:00
Raphael S. Carvalho	a82afa68aa	test/lib/cql_test_env: reenable auto compaction after `e40aa042a7`, auto compaction is explicitly disabled on all tables being populated and only enabled later on in the boot process. we forgot to update cql_test_env to also reenable auto compaction, so unit tests based on cql_test_env were not compacting at all. database_test, for example, was running out of file descriptors because the number kept growing unboundly due to lack of compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200618225621.15937-1-raphaelsc@scylladb.com>	2020-06-22 14:03:13 +03:00
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Tomasz Grabiec	ab274b8203	sstables: clustered_index: Relax scope of validity of entry_info entry_info holds views, which may get invalidated when the containing index blocks are removed. Current implementations of next_entry() keeps the blocks in memory as long as the cursor is alive but that will change in new implementations of the cursor. Adjust the assumption of tests accordingly.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	f2e52c433f	sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	d5bf540079	sstables: Abstract the clustering index cursor behavior In preparation for supporting more than one algorithm for lookups in the promoted index, extract relevant logic out of the index_reader (which is a partition index cursor). The clustered index cursor implementation is now hidden behind abstract interface called clustered_index_cursor. The current implementation is put into the scanning_clustered_index_cursor. It's mostly code movement with minor adjustments. In order to encapsulate iteration over promoted index entries, clustered_index_cursor::next_entry() was introduced. No change in behavior intended in this patch.	2020-06-16 16:14:17 +02:00
Pavel Emelyanov	60e283b23e	auth: Move away from storage_service Now after the auth start/stop is standalone, we can remove reference from storage service to it. This frees some tests from the need to carry the auth service around for nothing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Pavel Emelyanov	6a46721fb7	auth: Move start-stop code into main The auth service management is currently sitting in storage service, but it was needed there just for cql/thrift start code. After the latters has been moved away there are no other reasons for the auth to be integrated with the storage service, so move it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Rafael Ávila de Espíndola	555d8fe520	build: Be consistent about system versus regular headers We were not consistent about using '#include "foo.hh"' instead of '#include <foo.hh>' for scylla's own headers. This patch fixes that inconsistency and, to enforce it, changes the build to use -iquote instead of -I to find those headers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200608214208.110216-1-espindola@scylladb.com>	2020-06-10 15:49:51 +03:00
Dejan Mircevski	9027b6636f	Use sstring_view in execute_cql and assertions This lets the functions operate on a wider variety of arguments and may also be faster. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-10 08:10:43 +03:00
Glauber Costa	3972628fc0	compaction: split compaction.hh header compaction.hh is one of our heavy headers, but some users just want to use information on it about how to describe a compaction, not how to perform one. For that reason this patch splits the compaction_descriptor into a new header. The compaction_descriptor has, as a member type, compaction_options. That is moved too, and brings with it the compaction_type. Both of those structures would make sense in a separate header anyway. The compaction_descriptor also wants the creator_fn and replacer_fn functions. We also take this opportunity to rename them into something more descriptive Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-08 16:06:00 -04:00
Avi Kivity	6f394e8e90	tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators The comparison operator (<=>) default implementation happens to exactly match tombstone::compare(), so use the compiler-generated defaults. Also default operator== and operator!= (these are not brought in by operator<=>). These become slightly faster as they perform just an equality comparison, not three-way compare. shadowable_tombstone and row_tombstone depend on tombstone::compare(), so convert them too in a similar way. with_relational_operations.hh becomes unused, so delete it. Tests: unit (dev) Message-Id: <20200602055626.2874801-1-avi@scylladb.com>	2020-06-02 09:28:52 +03:00
Piotr Sarna	160e2b06f9	test: move random string helpers to .cc ... since there's no reason for them to reside in a header, and .cc is our default destination. Message-Id: <2509410f0f71df036a7829f1f799503c1a671404.1591078777.git.sarna@scylladb.com>	2020-06-02 09:27:59 +03:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Piotr Sarna	91e02ed3ad	test/lib: add generating random numeric string Useful for testing random numeric inputs, e.g. big decimals.	2020-06-01 16:11:49 +02:00
Botond Dénes	c5b0e8a45a	test: move thread-safe test macro alternatives to lib/test_utils.hh Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200529130706.149603-2-bdenes@scylladb.com>	2020-05-31 16:08:02 +03:00
Botond Dénes	d68ac8bf18	treewide: remove all uses of no_reader_permit()	2020-05-28 11:34:35 +03:00
Botond Dénes	b5aa08ed77	sstables: pass valid permits to all internal reads We will soon require a valid permit for all reads, including low level index reads. The sstable layer has several internal reads which can not be associated with either the user or the system read semaphores or it would be very hard to obtain the correct semaphore, for limited/no gain. To be able to pass a valid permit still, we either expose a permit parameter so upper layers can pass down one, or create a local semaphore for these reads and use that to obtain a permit. The following methods now require a permit to be passed to them: * `sstables::sstabe::read_data()`: only used in tests. The following methods use internal semaphores: * `sstables::sstable::generate_summary()` used when loading an sstable. * `sstables::sstable::has_partition_key()`: used by a REST API method.	2020-05-28 11:34:35 +03:00
Botond Dénes	734e995639	database: add compaction read concurrency semaphore All reads will soon require a valid permit, including those done during compaction. To allow creating valid permits for these reads create a compaction specific semaphore. This semaphore is unlimited as compaction concurrency is managed by higher level layer, we use just for resource usage accounting.	2020-05-28 11:34:35 +03:00

1 2 3 4

171 Commits