scylladb

Author	SHA1	Message	Date
Dejan Mircevski	aec1acd1d5	range_test: Add cases for singular intersection Intersection was previously not tested for singular ranges. This ensures it will always work for singular ranges, too. Tests: unit(dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-18 12:38:31 +03:00
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Juliusz Stasiewicz	8628ede009	cdc: Fix segfault when stream ID key is too short When a token is calculated for stream_id, we check that the key is exactly 16 bytes long. If it's not - `minimum_token` is returned and client receives empty result. This used to be the expected behavior for empty keys; now it's extended to keys of any incorrect length. Fixes #6570	2020-06-17 18:19:37 +03:00
Nadav Har'El	095ddf0d41	alternator test: use ConsistentRead=True where missing All tests that write some data and then read it back need to use ConsistentRead=True, otherwise the test may sporadically fail on a multi- node cluster. In the previous patch we fixed the full_query()/full_scan() convenience functions. In this patch, I audited the calls to the boto3 read methods - get_item(), batch_get_item(), query(), scan(), and although most of them did use ConsistentRead=True as needed, I found some missing and this patch fixes them. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616080334.825893-1-nyh@scylladb.com>	2020-06-17 14:57:45 +02:00
Nadav Har'El	c298088375	alternator test: use ConsistentRead=True for full_query/scan Many of the Alternator tests use the convenience functions full_query()/ full_scan() to read from the table. Almost all these tests need to be able to read their own writes, i.e., want ConsistentRead=True, but none of them explicitly specified this parameter. Such tests may sporadically fail when running on cluster with multiple nodes. So this patch follows a TODO in the code, and makes ConsistentRead=True the default for the full_() functions. The caller can still override it with ConsistentRead=False - and this is necessary in the GSI tests, because ConsistentRead=True is not allowed in GSIs. Note that while ConsistentRead=True is now the default for the full_() convenience functions, but it is still not the default for the lower level boto3 functions scan(), query() and get_item() - so usages of those should be evaluated as well and missing ConsistentRead=True, if any, should be added. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616073821.824784-1-nyh@scylladb.com>	2020-06-17 14:57:45 +02:00
Tomasz Grabiec	19501d9ef2	sstables, tests: Allow disabling binary search in promoted index from perf tests	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	c95dd67d11	utils: Introduce cached_file It is a read-through cache of a file. Will be used to cache contents of the promoted index area from the index file. Currently, cached pages are evicted manually using the invalidate_*() method family, or when the object is destroyed. The cached_file represents a subset of the file. The reason for this is to satisfy two requirements. One is that we have a page-aligned caching, where pages are aligned relative to the start of the underlying file. This matches requirements of the seastar I/O engine on I/O requests. Another requirement is to have an effective way to populate the cache using an unaligned buffer which starts in the middle of the file when we know that we won't need to access bytes located before the buffer's position. See populate_front(). If we couldn't assume that, we wouldn't be able to insert an unaligned buffer into the cache.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	ab274b8203	sstables: clustered_index: Relax scope of validity of entry_info entry_info holds views, which may get invalidated when the containing index blocks are removed. Current implementations of next_entry() keeps the blocks in memory as long as the cursor is alive but that will change in new implementations of the cursor. Adjust the assumption of tests accordingly.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	f2e52c433f	sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	d5bf540079	sstables: Abstract the clustering index cursor behavior In preparation for supporting more than one algorithm for lookups in the promoted index, extract relevant logic out of the index_reader (which is a partition index cursor). The clustered index cursor implementation is now hidden behind abstract interface called clustered_index_cursor. The current implementation is put into the scanning_clustered_index_cursor. It's mostly code movement with minor adjustments. In order to encapsulate iteration over promoted index entries, clustered_index_cursor::next_entry() was introduced. No change in behavior intended in this patch.	2020-06-16 16:14:17 +02:00
Avi Kivity	d17b05e911	Merge 'Adding Optimized pseudo floating point estimated histogram' from Amnon " This series Adds a pseudo-floating-point histogram implementation. The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram. Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values. Fixes #5815 Fixes #4746 " * amnonh-quicker_estimated_histogram: storage_proxy: use time_estimated_histogram for latencies test/boost/estimated_histogram_test utils/histogram_metrics_helper Adding histogram converter utils/estimated_histogram: Adding approx_exponential_histogram	2020-06-15 10:19:36 +03:00
Amnon Heiman	1cbc2e3d3e	test/boost/estimated_histogram_test This patch adds basic testing for the approx_exponential_histogram implementations. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2020-06-15 08:22:57 +03:00
Piotr Sarna	23c63ec19d	Merge 'alternator: implement FilterExpression' from Nadav The main goal of this series is to implement FilterExpression - the newer syntax for filtering results of Query and Scan requests. This feature itself is just one simple patch - it just needs to have the already-existing filtering code call the already-existing expression evaluation code. However, before we can do this, we need a patch to refactor the expression-evaluation interface (this patch also fixes pre-existing bugs). Then we need three additional patches to fix pre- existing bugs in the various corner cases of expressions (this bugs already existed in ConditionExpression but now became visible in tests for FilerExpression). Finally, in the end of the series, we also do a bit of code cleanup. After this series, the FilterExpression feature is complete, and all tests for this feature pass. Tests: unit(dev) * 'alternator-filterexpression' of git://github.com/nyh/scylla: alternator: avoid unnecessary conversion to string alternator: move some code out of executor.cc alternator: implement FilterExpression alternator: improve error path of attribute_type() function alternator: fix begins_with() error path alternator: fix corner case of contains() function in conditions alternator: refactor resolving of references in expressions	2020-06-14 19:42:46 +02:00
Nadav Har'El	0b9f25ab50	alternator: implement FilterExpression This patch provides a complete implementation for the FilterExpression parameter - the newer syntax for filtering the results of the Query or Scan operations. The implementation is pretty straightforward - we already added earlier a result-filtering framework to Alternator, and used it for the older filtering syntax - QuryFilter and ScanFilter. All we had to do now was to run the FilterExpression (which has the same syntax as a ConditionExpression) on each individual items. The previous cleanup patches were important to reduce the friction of running these expressions on the items. After the previous patches fixing small esoteric bugs in a few expression functions, with this patch all the tests in test_filter_expression.py now pass, and so do the two FilterExpression tests in test_query.py and test_scan.py. As far as I know (and of course minus any bugs we'll discover later), this marks the FilterExpression feature complete. Fixes #5038. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 12:16:26 +03:00
Nadav Har'El	f87259a762	alternator: improve error path of attribute_type() function The attribute_type() function, which can be used in expressions like ConditionExpression and FilterExpression, is supposed to generate an error if its second parameter is not one of the known types. What we did until now was to just report a failed check in this case. We already had a reproducing test with FilterExpression, but in this patch we also add a test with ConditionExpression - which fails before this patch and passes afterwards (and of course, passes with DynamoDB). Fixes #6641. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 12:16:20 +03:00
Nadav Har'El	11d86dfb06	alternator: fix begins_with() error path The begins_with() function should report an error if a constant is passed to it which isn't one of the supported types - string or bytes (e.g., a number). The code we had to check this had wrong logic, though. If the item attribute was also a number, we silently returned false, and didn't go on to detect that the second parameter - a constant - was a number too and should generate an error - not be silent. Fixed and added a reproducing test case and another test to validate my understanding of the type of parameters that begins_with() accepts. Fixes #6640. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 12:13:23 +03:00
Nadav Har'El	f79a4e0e78	alternator: fix corner case of contains() function in conditions It turns out that the contains() functions in the new syntax of conditions (ConditionExpression, FilterExpression) is not identical to the CONTAINS operator in the old-syntax conditions (Expected). In the new syntax, one can check whether any constant object is contained in a list. In the old syntax, the constant object must be of specific types. So we need to move the testing out of the check_CONTAINS() functions that both implementations used, and into just the implementation of the old syntax (in conditions.cc). This bug broke one of the FilterExpression tests, but this patch also adds new tests for the different behaviour of ConditionExpression and Expected - tests which also reproduce this issue and verify its fix. Fixes #6639. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 12:02:14 +03:00
Nadav Har'El	13ef31f38b	alternator: refactor resolving of references in expressions In the DynamoDB API, expressions (e.g., ConditionExpression and many more) may contain references to column names ("#name") or to values (":val") given in a separate part of the request - ExpressionAttributeNames and ExpressionAttributeValues respectively. Before this patch, we resolved these references as part of the expression's evaluation. This approach had two downsides: 1. It often misdiagnosed (both false negatives and false positives) cases of unused names and values in expressions. We already had two xfailing tests with examples - which pass after this patch. This patch also adds two additional tests, which failed before this patch and pass with it. 2. In one of the following patches we will add support for FilterExpression, where the same expression is used repeatedly on many items. It is a waste (as well as makes the code uglier) to resolve the same references again and again each time the expression is evaluated. We should be able to do it just once. So this patch introduces an intermediate step between parsing and evaluating an expression - "resolving" the expression. The new resolve_() functions modify the already parsed expression, replacing references to attribute names and constant values by the actual names and values taken from the request. The resolve_() functions also keep track which references were used, making it very easy to check (as DynamoDB does) if there are any unused names or values, before starting the evaluation. The interface of evaluate() functions become much simpler - they no longer need to know the original request (which was previously needed for ExpressionAttributeNames/Values), the table's schema (which was previously needed only for some error checking), keep track of which references were used. This simplification is helpful for using the expressions in contexts where these things (request and schema) are no longer conveniently available, namely in FilterExpression. A small side-benefit of this patch is that it moves a bit of code, which handled resolving of references in expressions, from executor.cc to expressions.cc. This is just the first step in a bigger effort to reduce the size of executor.cc by moving code to smaller source files. There is no attempt in this patch to move as much code as we can. We will move more code in a separate patch in this series. Fixes #6572. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2020-06-14 11:57:13 +03:00
Pavel Emelyanov	60e283b23e	auth: Move away from storage_service Now after the auth start/stop is standalone, we can remove reference from storage service to it. This frees some tests from the need to carry the auth service around for nothing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Pavel Emelyanov	6a46721fb7	auth: Move start-stop code into main The auth service management is currently sitting in storage service, but it was needed there just for cql/thrift start code. After the latters has been moved away there are no other reasons for the auth to be integrated with the storage service, so move it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-06-12 22:14:33 +03:00
Rafael Ávila de Espíndola	555d8fe520	build: Be consistent about system versus regular headers We were not consistent about using '#include "foo.hh"' instead of '#include <foo.hh>' for scylla's own headers. This patch fixes that inconsistency and, to enforce it, changes the build to use -iquote instead of -I to find those headers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200608214208.110216-1-espindola@scylladb.com>	2020-06-10 15:49:51 +03:00
Nadav Har'El	65d3e3992f	alternator test: small fixes for test_key_condition_expression_multi The test test_key_condition_expression_multi() had a small typo, which was hidden by the fact that the request was expected to fail for other reasons, but nevertheless should be fixed. Moreover, it appears that the Amazon DynamoDB changed their error message for this case, so running the test with "--aws" failed. So this patch makes it work again by being more forgiving on the exact error message. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200609205628.562351-1-nyh@scylladb.com>	2020-06-10 07:34:20 +02:00
Dejan Mircevski	9027b6636f	Use sstring_view in execute_cql and assertions This lets the functions operate on a wider variety of arguments and may also be faster. Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-10 08:10:43 +03:00
Glauber Costa	aebd965f0e	distributed_load: initial handling of off-strategy SSTables Off-strategy SSTables are SSTables that do not conform to the invariants that the compaction strategies define. Examples of offstrategy SSTables are SSTables acquired over bootstrap, resharding when the cpu count changes or imported from other databases through our upload directory. This patch introduces a new class, sstable_directory, that will handle SSTables that are present in a directory that is not one of the directories where the table expects its SSTables. There is much to be done to support off-strategy compactions fully. To make sure we make incremental progress, this patch implements enough code to handle resharding of SSTables in the upload directory. SSTables are resharded in place, before we start accessing the files. Later, we will take other steps before we finally move the SSTables into the main directory. But for now, starting with resharding will not only allow us to start small, but it will also allow us to start unleashing much needed cleanups in many places. For instance, once we start resharding on boot before making the SSTables available, we will be able to expurge all places in Scylla where, during normal operations, we have extra handler code for the fact that SSTables could be shared. Tests: a new test is added and it passes in debug mode. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-08 16:06:00 -04:00
Glauber Costa	3972628fc0	compaction: split compaction.hh header compaction.hh is one of our heavy headers, but some users just want to use information on it about how to describe a compaction, not how to perform one. For that reason this patch splits the compaction_descriptor into a new header. The compaction_descriptor has, as a member type, compaction_options. That is moved too, and brings with it the compaction_type. Both of those structures would make sense in a separate header anyway. The compaction_descriptor also wants the creator_fn and replacer_fn functions. We also take this opportunity to rename them into something more descriptive Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-08 16:06:00 -04:00
Kamil Braun	a1e235b1a4	CDC: Don't split collection tombstone away from base update Overwriting a collection cell using timestamp T is a process with following steps: 1. inserting a row marker (if applicable) with timestamp T; 2. writing a collection tombstone with timestamp T-1; 3. writing the new collection value with timestamp T. Since CDC does clustering of the operations by timestamp, this would result in 3 separate calls to `transform` (in case of INSERT, or 2 - in the case of UPDATE), which seems excessive, especially when pre-/postimage is enabled. This patch makes collection tombstones being treated as if they had the same TS as the base write and thus they are processed in one call to `transform` (as long as TTLs are not used). Also, `cdc_test` had to be updated in places that relied on former splitting strategy. Fixes #6084	2020-06-07 17:09:05 +03:00
Raphael S. Carvalho	8e47f61df7	compaction: Enable tombstone expiration based on the presence of the sstable set For tombstone expiration to proceed correctly without the risk of resurrecting data, the sstable set must be present. Regular compaction and derivatives provide the sstable set, so they're able to expire tombstones with no resurrection risk. Resharding, on the other hand, can run on any shard, not necessarily on the same shard that one of the input sstables belongs to, so it currently cannot provide a sstable set for tombstone expiration to proceed safely. That being said, let's only do expiration based on the presence of the set. This makes room for the sstable set to be feeded to compaction via descriptor, allowing even resharding to do expiration. Currently, compaction thinks that sstable set can only come from the table, and that also needs to be changed for further flexibility. It's theoretically possible that a given resharding job will resurrect data if a fully expired SSTable is resharded at a shard which it doesn't belong to. Resharding will have no way to tell that expiring all that data will lead to resurrection because the relevant SSTables are at different shards. This is fixed by checking for fully expired sstables only on presence of the sstable set. Fixes #6600. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200605200954.24696-1-raphaelsc@scylladb.com>	2020-06-07 11:46:48 +03:00
Kamil Braun	1b7f1806ac	test: improve comments on test_schema_digest_does_not_change This test tends to cause a lot of discussion resulting from not understanding what is actually being tested. Closes https://github.com/scylladb/scylla/issues/6582.	2020-06-05 14:30:02 +02:00
Kamil Braun	d89b7a0548	cdc: rename CDC description tables Commit `968177da04` has changed the schema of cdc_topology_description and cdc_description tables in the system_distributed keyspace. Unfortunately this was a backwards-incompatible change: these tables would always be created, irrespective of whether or not "experimental" was enabled. They just wouldn't be populated with experimental=off. If the user now tries to upgrade Scylla from a version before this change to a version after this change, it will work as long as CDC is protected b the experimental flag and the flag is off. However, if we drop the flag, or if the user turns experimental on, weird things will happen, such as nodes refusing to start because they try to populate cdc_topology_description while assuming a different schema for this table. The simplest fix for this problem is to rename the tables. This fix must get merged in before CDC goes out of experimental. If the user upgrades his cluster from a pre-rename version, he will simply have two garbage tables that he is free to delete after upgrading. sstables and digests need to be regenerated for schema_digest_test since this commit effectively adds new tables to the system_distributed keyspace. This doesn't result in schema disagreement because the table is announced to all nodes through the migration manager.	2020-06-05 09:59:16 +02:00
Piotr Sarna	9a4394327a	Merge 'CDC: Disallowed CDC for tables with counter column(s)' from Juliusz. CDC for counters is unimplemented as of now, therefore any attempt to enable CDC log on counter table needs to be clearly disallowed. This patch does exactly this. The check whether schema has counter columns is performed in `cdc_service::impl` in: - `on_before_create_column_family`, - `on_before_update_column_family` and, if so, results in `invalid_request_exception` thrown. Fixes #6553 * jul-stas-6553-disallow-cdc-for-counters: test/cql: Check that CDC for counters is disallowed CDC: Disallowed CDC for tables with counter column(s)	2020-06-05 07:46:53 +02:00
Nadav Har'El	ace1697aa9	alternator test: reproducer for unjustly refused condition expression This patch adds a test reproducing issue #6572, where the perfectly good condition expression: #name1 = :val1 OR #name2 = :val2 Gets refused because of the following combination in our implementation: 1. Short-circuit evaluation, i.e., after we discover #name1 = :val1 we don't evaluate the second half of the expression. 2. The list of "used" references is collected at evaluation time, instead of at parsing time. Because evaluation never reaches #name2 (or :val2) our implementation complains that they are not used, and refuses the request - which should have been allowed. This test xfails on Alternator. It passes on DynamoDB. Refs #6572 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200604171954.444291-1-nyh@scylladb.com>	2020-06-05 07:43:50 +02:00
Piotr Sarna	0ba23d2b40	test: add manual test for tagging return value While not very interesting by itself, the test case shows that in case of TagResource and UntagResource it's actually correct to return empty HTTP body instead of an empty JSON object, which was the case for PutItem. Message-Id: <6331963179c5174a695f0e9eeed17de6c9f9a3be.1591269516.git.sarna@scylladb.com>	2020-06-04 16:17:24 +03:00
Avi Kivity	0c34e114e2	Merge "Upgrade to seastar api version 3" (make_file_output_stream returns future) from Rafael " The new seastar api changes make_file_output_stream and make_file_data_sink to return futures. This series includes a few refactoring patches and the actual transition. " * 'espindola/api-v3-v3' of https://github.com/espindola/scylla: table: Fix indentation everywhere: Move to seastar api level 3 sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream sstables: Pass a data_sink to checksummed_file_writer's constructor sstables: Convert a file_writer constructor to a static make sstables: Move file_writer constructor out of line	2020-06-03 23:09:49 +03:00
Rafael Ávila de Espíndola	e5876f6696	everywhere: Move to seastar api level 3 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-03 10:32:46 -07:00
Rafael Ávila de Espíndola	13282b3d4c	sstables: Pass an output_stream to make_compressed_file_.*_format_output_stream This is a bit simpler as we don't have to pass in the options and moves the calls to make_file_output_stream to places where we can handle futures. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-03 10:32:46 -07:00
Juliusz Stasiewicz	bf4050ed15	test/cql: Check that CDC for counters is disallowed This test must be removed once we have implementation of CDC for tables with counter columns.	2020-06-03 18:31:44 +02:00
Raphael S. Carvalho	fb6976f1b9	Make sure SSTables created by streaming are added to backlog tracker New SStables are only added to backlog tracker if set_unshared() was called on their behalf. SStables created for streaming are not being added to the tracker because make_streaming_sstable_for_write() doesn't call set_unshared() nor does it caller. Which results in backlog not accounting for their existence, which means backlog will be much lower than expected. This problem could be fixed by adding a set_unshared() call but it turns out we don't even need set_unshared() anymore. It was introduced when Scylla metadata didn't exist, now a SSTable has built-in knowledge of whether or not it's shared. Relying on every SSTable creator calling set_unshared() is bug prone. Let's get rid of it and let the SStable itself say whether or not it's shared. If an imported SSTable has not Scylla metadata, Scylla will still be able to compute shards using token range metadata. Refs #6021. Refs #6227. Fixes #6441. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200512220226.134481-1-raphaelsc@scylladb.com>	2020-06-03 17:35:22 +03:00
Tomasz Grabiec	087fa42c1d	Merge "utils: inject errors around paxos stages" from Alejo Add Paxos error injections before/after save promise, proposal, decision, paxos_response_handler, delete decision. Adds a method to inject an error providing a lambda while avoiding to add a continuation when the error injection is disabled. For this provide error exception and enter() to allow flow control (i.e. return) on simple error injections without lambdas. Also includes Pavel's patch for CQL API for error injections, updated to current error injection API and added one_shot support. Also added some basic CQL API boost tests. For CQL API there's a limitation of the current grammar not supporting f(<terminal>) so values have to be inserted in a table until this is resolved. See #5411 * https://github.com/alecco/scylla/tree/error_injection_v11: paxos: fix indentation paxos: add error injections utils: add timeout error injection with lambda utils: error injection add enter() for control flow utils: error injections provide error exceptions failure_injector: implement CQL API for failure injector class lwt: fix disabled error injection templates	2020-06-03 15:42:10 +02:00
Piotr Sarna	8fc3ca855e	alternator: fix the return type of PutItem Even if there are no attributes to return from PutItem requests, we should return a valid JSON object, not an empty string. Fixes #6568 Tests: unit(dev)	2020-06-03 16:03:13 +03:00
Piotr Sarna	3aff52f56e	alternator: fix returning UnprocessedKeys unconditionally Client libraries (e.g. PynamoDB) expect the UnprocessedKeys and UnprocessedItems attributes to appear in the response unconditionally - it's hereby added, along with a simple test case. Fixes #6569 Tests: unit(dev)	2020-06-03 15:48:16 +03:00
Alejo Sanchez	a8b14b0227	utils: add timeout error injection with lambda Even though calling then() on a ready future does not allocate a continuation, calling then on the result of it will allocate. This error injection only adds a continuation in the dependency chain if error injections are enabled at compile timeand this particular error injection is enabled. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-06-03 14:44:00 +02:00
Alejo Sanchez	0321172677	utils: error injection add enter() for control flow For control flow (i.e. return) and simplicity add enter() method. For disabled injections, this method is const returning false, therefore it has no overhead. Add boost test. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2020-06-03 14:42:48 +02:00
Nadav Har'El	bea9629031	alternator: implement remaining QueryFilter / ScanFilter functionality This patch implements the missing QueryFilter (and ScanFilter) functionality:` 1. All operators. Previously, only the "EQ" operator was implemented. 2. Either "OR" or "AND" of conditions (previously only "AND"). 3. Correctly returning Count and ScannedCount for post-filter and pre-filter item counts, respectively. All of the previously-xfailing tests in test_query_filter.py are now passing. The implementation in this patch abandons our previous attempts to translate the DynamoDB API filters into Scylla's CQL filters. Doing this correctly for all operators would have been exceedingly difficult (for reasons explained in #5028), and simply not worth the effort: CQL's filters receive a page of results and then filter them, and we can do exactly the same without CQL's filters: The new code just retrieves an unfiltered page of items, and then for each of these items checks whether it passes the filters. The great thing is that we already had code for this checking - the QueryFilter syntax is identical to the "Expected" syntax (for conditional operations) that we already supported, so we already had code for checking these conditions, including all the different operators. This patch prepares for the future need to support also the newer FilterExpression syntax (see issue #5038), and the "filter" class supports either type of filter - the implementation for the second syntax is just missing and can be added (fairly easily) later. Fixes #5028. Refs #5038. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200603110118.399325-1-nyh@scylladb.com>	2020-06-03 13:16:45 +02:00
Nadav Har'El	f6b1f45d69	alternator: fix order conditions on binary attributes We implemented the order operators (LT, GT, LE, GE, BETWEEN) incorrectly for binary attributes: DynamoDB requires that the bytes be treated as unsigned for the purpose of order (so byte 128 is higher than 127), but our implementation uses Scylla's "bytes" type which has signed bytes. The solution is simple - we can continue to use the "bytes" type, but we need to use its compare_unsigned() function, not its "<" operator. This bug affected conditional operations ("Expected" and "ConditionExpression") and also filters ("QueryFilter", "ScanFilter", "FilterExpression"). The bug did not affect Query's key conditions ("KeyConditions", "KeyConditionExpression") because those already used Scylla's key comparison functions - which correctly compare binary blobs as unsigned bytes (in fact, this is why we have the compare_unsigned() function). The patch also adds tests that reproduce the bugs in conditional operations, and show that the bug did not exist in key conditions. Fixes #6573 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200603084257.394136-1-nyh@scylladb.com>	2020-06-03 10:55:50 +02:00
Nadav Har'El	0d337a716b	alternator test: confirm understanding of query paging with filtering This test (which passes successfully on both Alternator and DynamoDB) was written to confirm our understanding of how the paging feature works. Our understanding, based on DynamoDB documentation, has been that the "Limit" parameter determines the number of pre-filtering items, not the actual number of items returned after having passed the filter. So the number of items actually returned may be lower than Limit - in some cases even zero. This test tries an extreme case: We scan a collection of 20 items with a filter matching only 10 (or so) of them, with Limit=1, and count the number of pages that we needed to request until collecting all these 10 (or so) matches. We note that the result is 21 - i.e., DynamoDB and Alternator really went through the 20 pre-filtering items one by one, and for the items which didn't match the filter returned an empty page. The last page (the 21st) is always empty: DynamoDB or Alternator doesn't know whether or not there is a 21st item, and it takes a 21st request to discover there isn't. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200602145015.361694-1-nyh@scylladb.com>	2020-06-02 16:57:49 +02:00
Nadav Har'El	43138c0e5e	alternator test: test Count/ScannedCount return of Query This test reproduces a bug in the current implementation of QueryFilter, which returns for ScannedCount the count of post-filter items, whereas it should return the pre-filter count. The test tests both ScannedCount and Count, when QueryFilter is used and when it isn't used. The test currently xfails on Alternator, passes on DynamoDB. Refs #5028 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200602125924.358636-1-nyh@scylladb.com>	2020-06-02 16:57:49 +02:00
Avi Kivity	6f394e8e90	tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators The comparison operator (<=>) default implementation happens to exactly match tombstone::compare(), so use the compiler-generated defaults. Also default operator== and operator!= (these are not brought in by operator<=>). These become slightly faster as they perform just an equality comparison, not three-way compare. shadowable_tombstone and row_tombstone depend on tombstone::compare(), so convert them too in a similar way. with_relational_operations.hh becomes unused, so delete it. Tests: unit (dev) Message-Id: <20200602055626.2874801-1-avi@scylladb.com>	2020-06-02 09:28:52 +03:00
Piotr Sarna	160e2b06f9	test: move random string helpers to .cc ... since there's no reason for them to reside in a header, and .cc is our default destination. Message-Id: <2509410f0f71df036a7829f1f799503c1a671404.1591078777.git.sarna@scylladb.com>	2020-06-02 09:27:59 +03:00
Avi Kivity	a4c44cab88	treewide: update concepts language from the Concepts TS to C++20 Seastar recently lost support for the experimental Concepts Technical Specification (TS) and gained support for C++20 concepts. Re-enable concepts in Scylla by updating our use of concepts to the C++20 standard. This change: - peels off uses of the GCC6_CONCEPT macro - removes inclusions of <seastar/gcc6-concepts.hh> - replaces function-style concepts (no longer supported) with equation-style concepts - semicolons added and removed as needed - deprecated std::is_pod replaced by recommended replacement - updates return type constraints to use concepts instead of type names (either std::same_as or std::convertible_to, with std::same_as chosen when possible) No attempt is made to improve the concepts; this is a specification update only. Message-Id: <20200531110254.2555854-1-avi@scylladb.com>	2020-06-02 09:12:21 +03:00
Piotr Sarna	d1f5d42a25	test: add big_decimal perf test In order to be able to measure the impact of rewritting the parsing mechanism from std::regex to a hand-written state machine.	2020-06-01 16:11:49 +02:00

1 2 3 4 5 ...

562 Commits