scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 11:55:15 +00:00

Author	SHA1	Message	Date
Dejan Mircevski	aec1acd1d5	range_test: Add cases for singular intersection Intersection was previously not tested for singular ranges. This ensures it will always work for singular ranges, too. Tests: unit(dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-18 12:38:31 +03:00
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Tomasz Grabiec	c95dd67d11	utils: Introduce cached_file It is a read-through cache of a file. Will be used to cache contents of the promoted index area from the index file. Currently, cached pages are evicted manually using the invalidate_*() method family, or when the object is destroyed. The cached_file represents a subset of the file. The reason for this is to satisfy two requirements. One is that we have a page-aligned caching, where pages are aligned relative to the start of the underlying file. This matches requirements of the seastar I/O engine on I/O requests. Another requirement is to have an effective way to populate the cache using an unaligned buffer which starts in the middle of the file when we know that we won't need to access bytes located before the buffer's position. See populate_front(). If we couldn't assume that, we wouldn't be able to insert an unaligned buffer into the cache.	2020-06-16 16:15:23 +02:00
Avi Kivity	d17b05e911	Merge 'Adding Optimized pseudo floating point estimated histogram' from Amnon " This series Adds a pseudo-floating-point histogram implementation. The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram. Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values. Fixes #5815 Fixes #4746 " * amnonh-quicker_estimated_histogram: storage_proxy: use time_estimated_histogram for latencies test/boost/estimated_histogram_test utils/histogram_metrics_helper Adding histogram converter utils/estimated_histogram: Adding approx_exponential_histogram	2020-06-15 10:19:36 +03:00
Amnon Heiman	1cbc2e3d3e	test/boost/estimated_histogram_test This patch adds basic testing for the approx_exponential_histogram implementations. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-06-15 08:22:57 +03:00
Pavel Emelyanov	60e283b23e	auth: Move away from storage_service Now after the auth start/stop is standalone, we can remove reference from storage service to it. This frees some tests from the need to carry the auth service around for nothing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Rafael Ávila de Espíndola	555d8fe520	build: Be consistent about system versus regular headers We were not consistent about using '#include "foo.hh"' instead of '#include <foo.hh>' for scylla's own headers. This patch fixes that inconsistency and, to enforce it, changes the build to use -iquote instead of -I to find those headers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200608214208.110216-1-espindola@scylladb.com>	2020-06-10 15:49:51 +03:00
Glauber Costa	aebd965f0e	distributed_load: initial handling of off-strategy SSTables Off-strategy SSTables are SSTables that do not conform to the invariants that the compaction strategies define. Examples of offstrategy SSTables are SSTables acquired over bootstrap, resharding when the cpu count changes or imported from other databases through our upload directory. This patch introduces a new class, sstable_directory, that will handle SSTables that are present in a directory that is not one of the directories where the table expects its SSTables. There is much to be done to support off-strategy compactions fully. To make sure we make incremental progress, this patch implements enough code to handle resharding of SSTables in the upload directory. SSTables are resharded in place, before we start accessing the files. Later, we will take other steps before we finally move the SSTables into the main directory. But for now, starting with resharding will not only allow us to start small, but it will also allow us to start unleashing much needed cleanups in many places. For instance, once we start resharding on boot before making the SSTables available, we will be able to expurge all places in Scylla where, during normal operations, we have extra handler code for the fact that SSTables could be shared. Tests: a new test is added and it passes in debug mode. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-08 16:06:00 -04:00
Kamil Braun	a1e235b1a4	CDC: Don't split collection tombstone away from base update Overwriting a collection cell using timestamp T is a process with following steps: 1. inserting a row marker (if applicable) with timestamp T; 2. writing a collection tombstone with timestamp T-1; 3. writing the new collection value with timestamp T. Since CDC does clustering of the operations by timestamp, this would result in 3 separate calls to `transform` (in case of INSERT, or 2 - in the case of UPDATE), which seems excessive, especially when pre-/postimage is enabled. This patch makes collection tombstones being treated as if they had the same TS as the base write and thus they are processed in one call to `transform` (as long as TTLs are not used). Also, `cdc_test` had to be updated in places that relied on former splitting strategy. Fixes #6084	2020-06-07 17:09:05 +03:00
Raphael S. Carvalho	8e47f61df7	compaction: Enable tombstone expiration based on the presence of the sstable set For tombstone expiration to proceed correctly without the risk of resurrecting data, the sstable set must be present. Regular compaction and derivatives provide the sstable set, so they're able to expire tombstones with no resurrection risk. Resharding, on the other hand, can run on any shard, not necessarily on the same shard that one of the input sstables belongs to, so it currently cannot provide a sstable set for tombstone expiration to proceed safely. That being said, let's only do expiration based on the presence of the set. This makes room for the sstable set to be feeded to compaction via descriptor, allowing even resharding to do expiration. Currently, compaction thinks that sstable set can only come from the table, and that also needs to be changed for further flexibility. It's theoretically possible that a given resharding job will resurrect data if a fully expired SSTable is resharded at a shard which it doesn't belong to. Resharding will have no way to tell that expiring all that data will lead to resurrection because the relevant SSTables are at different shards. This is fixed by checking for fully expired sstables only on presence of the sstable set. Fixes #6600. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200605200954.24696-1-raphaelsc@scylladb.com>	2020-06-07 11:46:48 +03:00
Kamil Braun	1b7f1806ac	test: improve comments on test_schema_digest_does_not_change This test tends to cause a lot of discussion resulting from not understanding what is actually being tested. Closes https://github.com/scylladb/scylla/issues/6582.	2020-06-05 14:30:02 +02:00
Kamil Braun	d89b7a0548	cdc: rename CDC description tables Commit `968177da04` has changed the schema of cdc_topology_description and cdc_description tables in the system_distributed keyspace. Unfortunately this was a backwards-incompatible change: these tables would always be created, irrespective of whether or not "experimental" was enabled. They just wouldn't be populated with experimental=off. If the user now tries to upgrade Scylla from a version before this change to a version after this change, it will work as long as CDC is protected b the experimental flag and the flag is off. However, if we drop the flag, or if the user turns experimental on, weird things will happen, such as nodes refusing to start because they try to populate cdc_topology_description while assuming a different schema for this table. The simplest fix for this problem is to rename the tables. This fix must get merged in before CDC goes out of experimental. If the user upgrades his cluster from a pre-rename version, he will simply have two garbage tables that he is free to delete after upgrading. sstables and digests need to be regenerated for schema_digest_test since this commit effectively adds new tables to the system_distributed keyspace. This doesn't result in schema disagreement because the table is announced to all nodes through the migration manager.	2020-06-05 09:59:16 +02:00
Avi Kivity	0c34e114e2	Merge "Upgrade to seastar api version 3" (make_file_output_stream returns future) from Rafael " The new seastar api changes make_file_output_stream and make_file_data_sink to return futures. This series includes a few refactoring patches and the actual transition. " * 'espindola/api-v3-v3' of https://github.com/espindola/scylla: table: Fix indentation everywhere: Move to seastar api level 3 sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream sstables: Pass a data_sink to checksummed_file_writer's constructor sstables: Convert a file_writer constructor to a static make sstables: Move file_writer constructor out of line	2020-06-03 23:09:49 +03:00
Rafael Ávila de Espíndola	e5876f6696	everywhere: Move to seastar api level 3 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-03 10:32:46 -07:00
Rafael Ávila de Espíndola	13282b3d4c	sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream This is a bit simpler as we don't have to pass in the options and moves the calls to make_file_output_stream to places where we can handle futures. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-03 10:32:46 -07:00
Raphael S. Carvalho	fb6976f1b9	Make sure SSTables created by streaming are added to backlog tracker New SStables are only added to backlog tracker if set_unshared() was called on their behalf. SStables created for streaming are not being added to the tracker because make_streaming_sstable_for_write() doesn't call set_unshared() nor does it caller. Which results in backlog not accounting for their existence, which means backlog will be much lower than expected. This problem could be fixed by adding a set_unshared() call but it turns out we don't even need set_unshared() anymore. It was introduced when Scylla metadata didn't exist, now a SSTable has built-in knowledge of whether or not it's shared. Relying on every SSTable creator calling set_unshared() is bug prone. Let's get rid of it and let the SStable itself say whether or not it's shared. If an imported SSTable has not Scylla metadata, Scylla will still be able to compute shards using token range metadata. Refs #6021. Refs #6227. Fixes #6441. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>	2020-06-03 17:35:22 +03:00
Tomasz Grabiec	087fa42c1d	Merge "utils: inject errors around paxos stages" from Alejo Add Paxos error injections before/after save promise, proposal, decision, paxos_response_handler, delete decision. Adds a method to inject an error providing a lambda while avoiding to add a continuation when the error injection is disabled. For this provide error exception and enter() to allow flow control (i.e. return) on simple error injections without lambdas. Also includes Pavel's patch for CQL API for error injections, updated to current error injection API and added one_shot support. Also added some basic CQL API boost tests. For CQL API there's a limitation of the current grammar not supporting f(<terminal>) so values have to be inserted in a table until this is resolved. See #5411 * https://github.com/alecco/scylla/tree/error_injection_v11: paxos: fix indentation paxos: add error injections utils: add timeout error injection with lambda utils: error injection add enter() for control flow utils: error injections provide error exceptions failure_injector: implement CQL API for failure injector class lwt: fix disabled error injection templates	2020-06-03 15:42:10 +02:00
Alejo Sanchez	a8b14b0227	utils: add timeout error injection with lambda Even though calling then() on a ready future does not allocate a continuation, calling then on the result of it will allocate. This error injection only adds a continuation in the dependency chain if error injections are enabled at compile timeand this particular error injection is enabled. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-06-03 14:44:00 +02:00
Alejo Sanchez	0321172677	utils: error injection add enter() for control flow For control flow (i.e. return) and simplicity add enter() method. For disabled injections, this method is const returning false, therefore it has no overhead. Add boost test. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-06-03 14:42:48 +02:00
Piotr Sarna	ecc4a87a24	test: add test cases to big_decimal_test Test cases for big decimals were quite complete, but since the implementation was recently changed, some corner cases are added: - incorrect strings - numbers not fitting into uint64_t - numbers less than uint64_t::max themselves, but with the unscaled value exceeding the maximum	2020-06-01 16:11:49 +02:00
Botond Dénes	7c56e79355	test/multishard_mutation_query_test: eliminate another unsafely used boost test macro Boost test macros are not thread safe, using them from multiple threads results in garbled XML test report output. `3f1823a4f0` replaced most of the thread-unsafe boost test macros in multishard_mutation_query_test, but one still managed to slip through the cracks. This patch removes that as well. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200529130706.149603-3-bdenes@scylladb.com>	2020-05-31 16:08:02 +03:00
Botond Dénes	c5b0e8a45a	test: move thread-safe test macro alternatives to lib/test_utils.hh Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200529130706.149603-2-bdenes@scylladb.com>	2020-05-31 16:08:02 +03:00
Botond Dénes	7ea64b1838	test: mutation_reader_test: use <ranges> Replace all the ranges stuff we use from boost with the std equivalents. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200529141407.158960-3-bdenes@scylladb.com>	2020-05-31 12:58:59 +03:00
Avi Kivity	0c6bbc84cd	Merge "Classify queries based on their initiator, rather than their target" from Botond " Currently we classify queries as "system" or "user" based on the table they target. The class of a query determines how the query is treated, currently: timeout, limits for reverse queries and the concurrency semaphore. The catch is that users are also allowed to query system tables and when doing so they will bypass the limits intended for user queries. This has caused performance problems in the past, yet the reason we decided to finally address this is that we want to introduce a memory limit for unpaged queries. Internal (system) queries are all unpaged and we don't want to impose the same limit on them. This series uses scheduling groups to distinguish user and system workloads, based on the assumption that user workloads will run in the statement scheduling group, while system workloads will run in the main (or default) scheduling group, or perhaps something else, but in any case not in the statement one. Currently the scheduling group of reads and writes is lost when going through the messaging service, so to be able to use scheduling groups to distinguish user and system reads this series refactors the messaging service to retain this distinction across verb calls. Furthermore, we execute some system reads/writes as part of user reads/writes, such as auth and schema sync. These processes are tagged to run in the main group. This series also centralises query classification on the replica and moves it to a higher level. More specifically, queries are now classified -- the scheduling group they run in is translated to the appropriate query class specific configuration -- on the database level and the configuration is propagated down to the lower layers. Currently this query class specific configuration consists of the reader concurrency semaphore and the max memory limit for otherwise unlimited queries. A corollary of the semaphore begin selected on the database level is that the read permit is now created before the read starts. A valid permit is now available during all stages of the read, enabling tracking the memory consumption of e.g. the memtable and cache readers. This change aligns nicely with the needs of more accurate reader memory tracking, which also wants a valid permit that is available in every layer. The series can be divided roughly into the following distinct patch groups: * 01-02: Give system read concurrency a boost during startup. * 03-06: Introduce user/system statement isolation to messaging service. * 07-13: Various infrastructure changes to prepare for using read permits in all stages of reads. * 14-19: Propagate the semaphore and the permit from database to the various table methods that currently create the permit. * 20-23: Migrate away from using the reader concurrency semaphore for waiting for admission, use the permit instead. * 24: Introduce `database::make_query_config()` and switch the database methods needing such a config to use it. * 25-31: Get rid of all uses of `no_reader_permit()`. * 32-33: Ban empty permits for good. * 34: querier_cache: use the queriers' permits to obtain the semaphore. Fixes: #5919 Tests: unit(dev, release, debug), dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual testing with a 2 node mixed cluster with extra logging. " * 'query-class/v6' of https://github.com/denesb/scylla: (34 commits) querier_cache: get semaphore from querier reader_permit: forbid empty permits reader_permit: fix reader_resources::operator bool treewide: remove all uses of no_reader_permit() database: make_multishard_streaming_reader: pass valid permit to multi range reader sstables: pass valid permits to all internal reads compaction: pass a valid permit to sstable reads database: add compaction read concurrency semaphore view: use valid permits for reads from the base table database: use valid permit for counter read-before-write database: introduce make_query_class_config() reader_concurrency_semaphore: remove wait_admission and consume_resources() test: move away from reader_concurrency_semaphore::wait_admission() reader_permit: resource_units: introduce add() mutation_reader: restricted_reader: work in terms of reader_permit row_cache: pass a valid permit to underlying read memtable: pass a valid permit to the delegate reader table: require a valid permit to be passed to most read methods multishard_mutation_query: pass a valid permit to shard mutation sources querier: add reader_permit parameter and forward it to the mutation_source ...	2020-05-29 10:11:44 +03:00
Raphael S. Carvalho	097a5e9e07	compaction: Disable garbage collected writer if interposer consumer is used GC writer, used for incremental compaction, cannot be currently used if interposer consumer is used. That's because compaction assumes that GC writer will be operated only by a single compaction writer at a given point in time. With interposer consumer, multiple writers will concurrently operate on the same GC writer, leading to race condition which potentially result in use-after-free. Let's disable GC writer if interposer consumer is enabled. We're not losing anything because GC writer is currently only needed on strategies which don't implement an interposer consumer. Resharding will always disable GC writer, which is the expected behavior because it doesn't support incremental compaction yet. The proper fix, which allows GC writer and interposer consumer to work together, will require more time to implement and test, and for that reason, I am postponing it as #6472 is a showstopper for the current release. Fixes #6472. tests: mode(dev). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20200526195428.230472-1-raphaelsc@scylladb.com>	2020-05-29 08:26:43 +02:00
Alejo Sanchez	bb08b5ad5a	utils: error injections provide error exceptions Provide non-timeout error exception to facilitate control flow in injected errors. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-05-28 11:13:55 +02:00
Pavel Solodovnikov	014883d560	failure_injector: implement CQL API for failure injector class The following UDFs are defined to control failure injector API usage: * enable_injection(name, args) * disable_injection(name) All arguments have string type. As currently function(terminal) is not supported by the parser, the arguments must come from selected rows. Added boost test for CQL API. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-05-28 11:13:55 +02:00
Alejo Sanchez	2c7e01a3b6	lwt: fix disabled error injection templates Fix disabled injection templates to match enabled ones. Fix corresponding test to not be a continuation. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-05-28 11:13:55 +02:00
Botond Dénes	e678f06a5e	querier_cache: get semaphore from querier Currently the `querier_cache` is passed a semaphore during its construction and it uses this semaphore to do all the inactive reader registering/unregistering. This is inaccurate as in theory cached reads could belong to different semaphores (although currently this is not yet the case). As all queriers store a valid permit now, use this permit to obtain the semaphore the querier is associated with, and register the inactive read with this semaphore.	2020-05-28 11:34:35 +03:00
Botond Dénes	d68ac8bf18	treewide: remove all uses of no_reader_permit()	2020-05-28 11:34:35 +03:00
Botond Dénes	e4c591aa67	database: introduce make_query_class_config() And use it to obtain any query-class specific configuration that was obtained from `table::config` before, such as the read concurrency semaphore and the max memory limit for unlimited queries. As all users of these items get these from the query class config now, we can remove them from `table::config`.	2020-05-28 11:34:35 +03:00
Botond Dénes	a08467da29	test: move away from reader_concurrency_semaphore::wait_admission() And use the reader_permit for this instead. This refactoring has revealed a pre-existing bug in the `test_lifecycle_policy`, which is also addressed in this patch. The bug is that said policy executes reader destructions in the background, and these are not waited for. For some reason, the semaphore -> permit transition pushes these races over the edge and we start seeing some of these destruction fibers still being unfinished when test scopes are exited, causing all sorts of trouble. The solution is to introduce a special gate that tests can use to wait for all background work to finish, before the test scope is exited.	2020-05-28 11:34:35 +03:00
Botond Dénes	4409579352	mutation_reader: restricted_reader: work in terms of reader_permit We want to refactor all read resource tracking code to work through the read_permit, so refactor the restricted reader to also do so.	2020-05-28 11:34:35 +03:00
Botond Dénes	fe024cecdc	row_cache: pass a valid permit to underlying read All reader are soon going to require a valid permit, so make sure we have a valid permit which we can pass to the underlying reader when creating it. This means `row_cache::make_reader()` now also requires a permit to be passed to it.	2020-05-28 11:34:35 +03:00
Botond Dénes	9ede82ebf8	memtable: pass a valid permit to the delegate reader All reader are soon going to require a valid permit, so make sure we have a valid permit which we can pass to the delegate reader when creating it. This means `memtable::make_flat_reader()` now also requires a permit to be passed to it. Internally the permit is stored in `scanning_reader`, which is used both for flushes and normal reads. In the former case a permit is not required.	2020-05-28 11:34:35 +03:00
Botond Dénes	cc5137ffe3	table: require a valid permit to be passed to most read methods Now that the most prevalent users (range scan and single partition reads) all pass valid permits we require all users to do so and propagate the permit down towards `make_sstable_reader()`. The plan is to use this permit for restricting the sstable readers, instead of the semaphore the table is configured with. The various `make_streaming_*reader()` overloads keep using the internal semaphores as but they also create the permit before the read starts and pass it to `make_sstable_reader()`.	2020-05-28 11:34:35 +03:00
Botond Dénes	d5ebd763ff	multishard_mutation_query: pass a valid permit to shard mutation sources In preparation of a valid permit being required to be passed to all mutation sources, create a permit before creating the shard readers and pass it to the mutation source when doing so. The permit is also persisted in the `shard_mutation_querier` object when saving the reader, which is another forward looking change, to allow the querier-cache to use it to obtain the semaphore the read is actually registered with.	2020-05-28 11:34:35 +03:00
Botond Dénes	bad53c4245	querier: add reader_permit parameter and forward it to the mutation_source In preparation of a valid permit being required to be passed to all mutation sources, also add a permit to the querier object, which is then passed to the source when it is used to create a reader.	2020-05-28 11:34:35 +03:00
Botond Dénes	14743c4412	data_query, mutation_query: use query_class_config We want to move away from the current practice of selecting the relevant read concurrency semaphore inside `table` and instead want to pass it down from `database` so that we can pass down a semaphore that is appropriate for the class of the query. Use the recently created `query_class_config` struct for this. This is added as a parameter to `data_query`, `mutation_query` and propagated down to the point where we create the `querier` to execute the read. We are already propagating down a parameter down the same route -- max_memory_reverse_query -- which also happens to be part of `query_class_config`, so simply replace this parameter with a `query_class_config` one. As the lower layers are not prepared for a semaphore passed from above, make sure this semaphore is the same that is selected inside `table`. After the lower layers are prepared for a semaphore arriving from above, we will switch it to be the appropriate one for the class of the query.	2020-05-28 11:34:35 +03:00
Botond Dénes	0b4ec62332	flat_mutation_reader: flat_multi_range_reader: add reader_permit parameter Mutation sources will soon require a valid permit so make sure we have one and pass it to the mutation sources when creating the underlying readers. For now, pass no_reader_permit() on call sites, deferring the obtaining of a valid permit to later patches.	2020-05-28 11:34:35 +03:00
Avi Kivity	829e2508d0	logalloc: fix entropy depletion in test_compaction_with_multiple_regions() test_compaction_with_multiple_regions() has two calls to std::shuffle(), one using std::default_random_engine() has the PRNG, but the other, later on, using the std::random_device directly. This can cause failures due to entropy pool exhaustion. Fix by making the `random` variable refer to the PRNG, not the random_device, and adjust the first std::shuffle() call. This hides the random_device so it can't be used more than once. Message-Id: <20200527124247.2187364-1-avi@scylladb.com>	2020-05-27 15:51:16 +03:00
Botond Dénes	3f1823a4f0	multishard_mutation_query_test: don't use boost test macros in multiple shards Boost test macros are not safe to use in multiple shards (threads). Doing so will result in their output being interwoven, making it unreadable and generating invalid XML test reports. There was a lot of back-and-forth on how to solve this, including introducing thread-safe wrappers of the boost test macros, that use locks. This patch does something much simple: it defines a bunch of replacement utility functions for the used macros. These functions use the thread safe seastar logger to log messages and throw exceptions when the test has to be failed, which is pretty much what boost test does too. With this the previously seen complaint about invalid XML is gone. Example log messages from the utility functions: DEBUG 2020-05-27 13:32:54,248 [shard 1] testlog - check_equal(): OK @ validate_result() test/boost/multishard_mutation_query_test.cc:863: ckp{0004fe57c8d2} == ckp{0004fe57c8d2} DEBUG 2020-05-27 13:32:54,248 [shard 1] testlog - require(): OK @ validate_result() test/boost/multishard_mutation_query_test.cc:855 Fixes: #4774 Tests: unit(dev) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200527104426.176342-1-bdenes@scylladb.com>	2020-05-27 15:50:05 +03:00
Pekka Enberg	8721534dfb	Merge "tests: avoid exhausting random_device entropy" from Avi " In several tests we were calling random_device::operator() in a tight loop. This is a slow operation, and in gcc 10 can fail if called too frequently due to a bug [1]. Change to use a random_engine instead, seeded once from the random_device. Tests: unit (dev) [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087 " * 'entropy' of git://github.com/avikivity/scylla: tests: lsa_sync_eviction_test: don't exhaust random number entropy tests: querier_cache_test: don't exhaust random number entropy tests: loading_cache_test: don't exhaust random number entropy tests: dynamic_bitset_test: don't exhaust random number entropy	2020-05-27 08:40:06 +03:00
Kamil Braun	7a98db2ab3	cdc: set ttl column in log rows which update only collections	2020-05-27 08:40:05 +03:00
Avi Kivity	8d27e1b4a9	Merge 'Propagate tracing to materialized view update path' from Piotr S In order to improve materialized views' debuggability, tracing points are added to view update generation path. Example trace: ``` ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2020-04-27 13:13:46.834000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0] \| 2020-04-27 13:13:46.834346 \| 127.0.0.1 \| 1 \| 127.0.0.1 Processing a statement [shard 0] \| 2020-04-27 13:13:46.834426 \| 127.0.0.1 \| 80 \| 127.0.0.1 Creating write handler for token: -3248873570005575792 natural: {127.0.0.1, 127.0.0.3} pending: {} [shard 0] \| 2020-04-27 13:13:46.834494 \| 127.0.0.1 \| 148 \| 127.0.0.1 Creating write handler with live: {127.0.0.3, 127.0.0.1} dead: {} [shard 0] \| 2020-04-27 13:13:46.834507 \| 127.0.0.1 \| 161 \| 127.0.0.1 Sending a mutation to /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.834519 \| 127.0.0.1 \| 173 \| 127.0.0.1 Executing a mutation locally [shard 0] \| 2020-04-27 13:13:46.834532 \| 127.0.0.1 \| 186 \| 127.0.0.1 View updates for ks.t require read-before-write - base table reader is created [shard 0] \| 2020-04-27 13:13:46.834570 \| 127.0.0.1 \| 224 \| 127.0.0.1 Reading key {{-3248873570005575792, pk{000400000002}}} from sstable /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db [shard 0] \| 2020-04-27 13:13:46.834608 \| 127.0.0.1 \| 262 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Index.db: scheduling bulk DMA read of size 8 at offset 0 [shard 0] \| 2020-04-27 13:13:46.834635 \| 127.0.0.1 \| 289 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Index.db: finished bulk DMA read of size 8 at offset 0, successfully read 8 bytes [shard 0] \| 2020-04-27 13:13:46.834975 \| 127.0.0.1 \| 629 \| 127.0.0.1 Message received from /127.0.0.1 [shard 0] \| 2020-04-27 13:13:46.834988 \| 127.0.0.3 \| 11 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db: scheduling bulk DMA read of size 41 at offset 0 [shard 0] \| 2020-04-27 13:13:46.835015 \| 127.0.0.1 \| 669 \| 127.0.0.1 View updates for ks.t require read-before-write - base table reader is created [shard 0] \| 2020-04-27 13:13:46.835020 \| 127.0.0.3 \| 44 \| 127.0.0.1 Generated 1 view update mutations [shard 0] \| 2020-04-27 13:13:46.835080 \| 127.0.0.3 \| 104 \| 127.0.0.1 Sending view update for ks.t_v2_idx_index to 127.0.0.2, with pending endpoints = {}; base token = -3248873570005575792; view token = 3728482343045213994 [shard 0] \| 2020-04-27 13:13:46.835095 \| 127.0.0.3 \| 119 \| 127.0.0.1 Sending a mutation to /127.0.0.2 [shard 0] \| 2020-04-27 13:13:46.835105 \| 127.0.0.3 \| 129 \| 127.0.0.1 View updates for ks.t were generated and propagated [shard 0] \| 2020-04-27 13:13:46.835117 \| 127.0.0.3 \| 141 \| 127.0.0.1 /home/sarna/.ccm/scylla-1/node1/data/ks/t-162ef290887811eaa4bf000000000000/mc-1-big-Data.db: finished bulk DMA read of size 41 at offset 0, successfully read 41 bytes [shard 0] \| 2020-04-27 13:13:46.835160 \| 127.0.0.1 \| 813 \| 127.0.0.1 Sending mutation_done to /127.0.0.1 [shard 0] \| 2020-04-27 13:13:46.835164 \| 127.0.0.3 \| 188 \| 127.0.0.1 Mutation handling is done [shard 0] \| 2020-04-27 13:13:46.835177 \| 127.0.0.3 \| 201 \| 127.0.0.1 Generated 1 view update mutations [shard 0] \| 2020-04-27 13:13:46.835215 \| 127.0.0.1 \| 869 \| 127.0.0.1 Locally applying view update for ks.t_v2_idx_index; base token = -3248873570005575792; view token = 3728482343045213994 [shard 0] \| 2020-04-27 13:13:46.835226 \| 127.0.0.1 \| 880 \| 127.0.0.1 Successfully applied local view update for 127.0.0.1 and 0 remote endpoints [shard 0] \| 2020-04-27 13:13:46.835253 \| 127.0.0.1 \| 907 \| 127.0.0.1 View updates for ks.t were generated and propagated [shard 0] \| 2020-04-27 13:13:46.835256 \| 127.0.0.1 \| 910 \| 127.0.0.1 Got a response from /127.0.0.1 [shard 0] \| 2020-04-27 13:13:46.835274 \| 127.0.0.1 \| 928 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 0] \| 2020-04-27 13:13:46.835276 \| 127.0.0.1 \| 930 \| 127.0.0.1 Mutation successfully completed [shard 0] \| 2020-04-27 13:13:46.835279 \| 127.0.0.1 \| 933 \| 127.0.0.1 Done processing - preparing a result [shard 0] \| 2020-04-27 13:13:46.835286 \| 127.0.0.1 \| 941 \| 127.0.0.1 Message received from /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.835331 \| 127.0.0.2 \| 14 \| 127.0.0.1 Sending mutation_done to /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.835399 \| 127.0.0.2 \| 82 \| 127.0.0.1 Mutation handling is done [shard 0] \| 2020-04-27 13:13:46.835413 \| 127.0.0.2 \| 96 \| 127.0.0.1 Got a response from /127.0.0.2 [shard 0] \| 2020-04-27 13:13:46.835639 \| 127.0.0.3 \| 662 \| 127.0.0.1 Delay decision due to throttling: do not delay, resuming now [shard 0] \| 2020-04-27 13:13:46.835640 \| 127.0.0.3 \| 664 \| 127.0.0.1 Successfully applied view update for 127.0.0.2 and 1 remote endpoints [shard 0] \| 2020-04-27 13:13:46.835649 \| 127.0.0.3 \| 673 \| 127.0.0.1 Got a response from /127.0.0.3 [shard 0] \| 2020-04-27 13:13:46.835841 \| 127.0.0.1 \| 1495 \| 127.0.0.1 Request complete \| 2020-04-27 13:13:46.834944 \| 127.0.0.1 \| 944 \| 127.0.0.1 ``` Fixes #6175 Tests: unit(dev), manual * psarna-propagate_tracing_to_more_write_paths: db,view: add tracing to view update generation path treewide: propagate trace state to write path	2020-05-27 08:40:05 +03:00
Avi Kivity	11698aafc1	tests: querier_cache_test: don't exhaust random number entropy rand_int() re-creates a random device each time it is called. Change it to use a static random_device, and get random numbers from a random_engine instead of from the device directly. This avoids exhausting entropy, see [1] for details. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087	2020-05-26 20:51:16 +03:00
Avi Kivity	e2f4c689b1	tests: loading_cache_test: don't exhaust random number entropy rand_int() re-creates a random device each time it is called. Change it to use a static random_device, and get random numbers from a random_engine instead of from the device directly. This avoids exhausting entropy, see [1] for details. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087	2020-05-26 20:49:58 +03:00
Avi Kivity	85da266cf4	tests: dynamic_bitset_test: don't exhaust random number entropy tests_random_ops() extracts a real random number from a random_device. Change it to use a random number engine. This avoids exhausting entropy, see [1] for details. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94087	2020-05-26 20:46:45 +03:00
Piotr Sarna	032a531ea6	test: add unit tests for alternator base64 conversions The test cases verify that base64 operations encode and decode their data properly. Tests: unit(dev)	2020-05-21 18:26:59 +03:00
Piotr Sarna	92aadb94e5	treewide: propagate trace state to write path In order to add tracing to places where it can be useful, e.g. materialized view updates and hinted handoff, tracing state is propagated to all applicable call sites.	2020-05-18 16:05:23 +02:00

1 2 3 4 5 ...

397 Commits