scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Author	SHA1	Message	Date
Avi Kivity	4d587e0c3d	cql3: raw_value: deduplicate view() and to_view() Commit `e739f2b779` ("cql3: expr: make evaluate() return a cql3::raw_value rather than an expr::constant") introduced raw_value::view() as a synonym to raw_value::to_view() to reduce churn. To fix this duplication, we now remove raw_value::to_view(). raw_value::to_view() was picked for removal because is has fewer call sites, reducing churn again. Closes #10819	2022-06-17 09:32:58 +02:00
Avi Kivity	19a6e69001	cql3: accept and type-check reused named bind variables A named bind-variable can be reused: SELECT * FROM tab WHERE a = :var AND b = :var Currently, the grammar just ignores the possibility and creates a new variable with the same name. The new variable cannot be referenced by name since the first one shadows it. Catch variable reuse by maintaining a map from bind variable names to indexed, and check that when reusing a bind variable the types match. A unit test is added. Fixes #10810 Closes #10813	2022-06-17 09:09:49 +02:00
Konstantin Osipov	670b2562a1	lwt: Cassandrda compatibility when incarnating a row for UPDATE When evaluating an LWT condition involving both static and non-static cells, and matching no regular row, the static row must be used UNLESS the IF condition is IF EXISTS/IF NOT EXISTS, in which case special rules apply. Before this fix, Scylla used to assume a row doesn't exist if there is no matching primary key. In Cassandra, if there is a non-empty static row in the partition, a regular row based on the static row' cell values is created in this case, and then this row is used to evaluate the condition. This problem was reported as gh-10081. The reason for Scylla behaviour before the patch was that when implementing LWT I tried to converge Cassandra data model (or lack of thereof) with a relational data model, and assumed a static row is a "shared" portion of a regular row, i.e. a storage level concept intended to save space, and doesn't have independent existence. This was an oversimplification. This patch fixes gh-10081, making Scylla semantics match the one of Cassandra. I will now list other known examples when a static row has an own independent existence as part of a table, for cataloguing purposes. SELECT * from a partition which has a partition key and a static cell set returns 1 row. If later a regular row is added to the partition, the SELECT would still return 1 row, i.e. the static row will disappear, and a regular row will appear instead. Another example showing a static row has an independent existence below: CREATE TABLE t (p int, c int, s int static, PRIMARY KEY(p, c)); INSERT INTO t (p, c) VALUES(1, 1); INSERT INTO t (p, s) VALUES(1, 1) IF NOT EXISTS; In Cassandra (and Scylla), IF NOT EXISTS evaluates to TRUE, even though both the regular row and the partition exist. But the static cells are not set, and the insert only provides a partition key, so the database assumes the insert is operating against a static row. It would be wrong to assume that a static row exists when the partition key exists: INSERT INTO t (p, c, s) VALUES(1, 1, 1) IF NOT EXISTS; [applied] \| p \| c \| s -----------+---+---+------ False \| 1 \| 1 \| null evaluates to False, i.e. the regular row does exist when p and c exist. Issue CREATE TABLE t (p INT, c INT, r INT, s INT static, PRIMARY KEY(p, c)) INSERT INTO t (p, s) VALUES (1, 1); UPDATE t SET s=2, r=1 WHERE p=1 AND c=1 IF s=1 and r=null; - in this case, even though the regular row doesn't exist, the static row does, and should be used for condition evaluation. In other words, IF EXISTS/IF NOT EXISTS have contextual semantics. They apply to the regular row if clustering key is used in the WHERE clause, otherwise they apply to static row. One analogy for static rows is that it is like a static member of C++ or Java class. It's an attribute of the class (assuming class = partition), which is accessible through every object of the class (object = regular row). It is also present if there are no objects of the class, but the class itself exists: i.e. a partition could have no regular rows, but some static cells set, in this case it has a static row. Unlike C++/Java static class members a static row is an optional attribute of the partition. A partition may exist, but the static row may be absent (e.g. no static cell is set). If the static row does exist, all regular rows share its contents, even if they do not exist. A regular row exists when its clustering key is present in the table. A static row exists when at least one static cell is set. Tests are updated because now when no matching row is found for the update we show the value of the static row as the previous value, instead of a non-matching clustering row. Changes in v2: - reworded the commit message - added select tests Closes #10711	2022-06-16 19:23:46 +03:00
Petr Gusev	d606966597	cql3::column_condition.cc: fix _in_marker handling The commit scylladb@5dee55d introduced a regression: type of in_list_receiver was taken from receiver instead of value_spec as it was before. This regression was caught by dtest test_lwt_update_prepared_listlike_and_tuples. This commit reverts to original behavior and adds a specific boost-test for this scenario. Fixes: #10821 Closes #10812	2022-06-16 10:57:12 +03:00
Botond Dénes	0b80b5850f	Merge 'allow view snapshots when automatic' from Michael Livshin A pre-scrub view snapshot cannot be attributed to user error, so no call to bail out. Closes #10760. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #10783 * github.com:scylladb/scylla: api-doc: correct spelling allow pre-scrub snapshots of materialized views and secondary indices	2022-06-16 08:47:33 +03:00
Botond Dénes	4bd4aa2e88	Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drop tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652. Closes #10807 * github.com:scylladb/scylla: memtable: Add counters for tombstone compaction memtable, cache: Eagerly compact data with tombstones memtable: Subtract from flushed memory when cleaning mvcc: Introduce apply_resume to hold state for partition version merging test: mutation: Compare against compacted mutations compacting_reader: Drop irrelevant tombstones mutation_partition: Extract deletable_row::compact_and_expire() mvcc: Apply mutations in memtable with preemption enabled test: memtable: Make failed_flush_prevents_writes() immune to background merging	2022-06-15 18:12:42 +03:00
Nadav Har'El	665e8c1a23	test/cql-ptest: add tests for collection indexing This patch adds an extensive array of tests for the Cassandra feature that Scylla hasn't implemented yet (issues #2962, #8745, #10707) of indexing the keys, values or entries of a collection column. The goal of these tests is to explicitly exercise every corner case I could think of by looking at the documentation of this feature and considering its possible implementation - and as usual, making sure that the tests actually pass on Cassandra. These tests overlap some of the existing unit tests that we translated from Cassandra, as well as some randomized tests that do not necessarily cover the same edge cases as these tests cover. All tests added in this patch pass on Cassandra, but currently fail on Scylla due to the above issues. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10771	2022-06-15 16:10:36 +02:00
Botond Dénes	6242f3fef8	Merge 'process_sstables_dir: close directory_lister on error' from Benny Halevy `sstable_directory::process_sstables_dir` may hit an exception when calling `handle_component`. In this case we currently destroy the `sstable_dir_lister` variable without closing the `directory_lister` first - leading to terminate in `~directory_lister` as seen in #10697. This mini-series handles this exception and always closes the `directory_lister`. Add unit test to reproduce this issue. Fixes #10697 Closes #10754 * github.com:scylladb/scylla: sstable_directory: process_sstable_dir: fixup indentation sstable_directory: process_sstable_dir: close directory_lister on error	2022-06-15 16:40:30 +03:00
Tomasz Grabiec	3bec1cc19f	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries. Fixes #10801 Refs #10793 (cherry picked from commit `0e78ad50ea`) Closes #10802	2022-06-15 14:33:19 +02:00
Avi Kivity	aa8f135f64	Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki If we reach a situation where flush rate exceeds compaction rate, we may end up with arbitrarily large number of sstables on disk. If a read is executed in such case, the amount of memory required is proportional to the number of sstables for the given shard, which in extreme cases can lead to OOM. In the wild, this was observed in 2 scenarios: - A node with >10 shards creates a keyspace with thousands of tables, drops the keyspace and shuts down before compaction finishes. Dropping keyspace drops tables, and each dropped table is smp::count writes to system.local table with flush after write, which creates tens of thousands of sstables. Bootstrap read from system.local will run OOM. - A failure to agree on table schema (due to a code bug) between nodes during repair resulted in excessive flushing of small sstables which compaction couldn't keep up with. In the unit test introduced in this patch series it can be proved that even hard setting maximum shares for compaction and minimum shares for flushing doesn't tilt the balance towards compaction enough to prevent the problem. Since it's a fast producer, slow consumer problem, the remaining solution is to block producer until the consumer catches up. If there are too many table runs originating from memtable, we block the current flush until the number of sstables is reduced (via ongoing compaction or a truncate operation). Fixes https://github.com/scylladb/scylla/issues/4116 Changelog: v5: - added a nicer way of timing the stalls caused by waiting for flush - added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups - added comment why we trigger compaction before waiting for sstable count reduction - removed unnecessary cv.signal from table::stop v4: - removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset v3: - removed unnecessary change to scheduling groups from v2 - moved sstables_changed signalling to suggested place in table::stop - added log how long the table flush was blocked for - changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <= v2: - Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well. - Converted table::stop to coroutines. - Reordered commits so that test is committed after fix, so it doesn't trip up bisection. Closes #10717 * github.com:scylladb/scylla: table: Add test where compaction doesn't keep up with flush rate. random_mutation_generator: Add option to specify ks_name and cf_name table: Prevent creating unbounded number of sstables	2022-06-15 14:51:08 +03:00
Benny Halevy	6cafd83e1c	sstable_directory: process_sstable_dir: close directory_lister on error Otherwise, if we don't consume all lister's entries, ~directory_lister terminates since the directory_lister is destroyed without being closed. Add unit test to reproduce this issue. Fixes #10697 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-15 13:56:10 +03:00
Tomasz Grabiec	94f9109bea	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	53026f3ba6	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	02c92d5ea2	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	570b76bc5b	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	a4e96960b8	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-15 11:29:43 +02:00
Tomasz Grabiec	c682521ac7	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries.	2022-06-15 11:29:43 +02:00
Mikołaj Sielużycki	25407a7e41	table: Add test where compaction doesn't keep up with flush rate. The test simulates a situation where 2 threads issue flushes to 2 tables. Both issue small flushes, but one has injected reactor stalls. This can lead to a situation where lots of small sstables accumulate on disk, and, if compaction never has a chance to keep up, resources can be exhausted.	2022-06-15 10:57:28 +02:00
Mikołaj Sielużycki	b5684aa96d	random_mutation_generator: Add option to specify ks_name and cf_name	2022-06-15 10:57:28 +02:00
Pavel Emelyanov	9a88bc260c	Merge 'various group0 start/stop issues' from Gleb The series fixes a couple of crashes that were found during starting and stopping Scylla with raft while doing ddl operations. Most of them related to shutdown order between different components. Also in scylla-dev gleb/group0-fixes-v1 CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/ * origin-dev/gleb/group0-fixes-v1: migration manager: remove unused code db/system_distributed_keyspace: do not announce empty schema main: stop raft before the migration manager storage_service: do not pass the raft group manager to storage_service constructor main: destroy the group0_client after stopping the group0	2022-06-15 11:44:03 +03:00
Michael Livshin	aab4cd850c	allow pre-scrub snapshots of materialized views and secondary indices Previously, any attempt to take a materialized view or secondary index snapshot was considered a mistake and caused the snapshot operation to abort, with a suggestion to snapshot the base table instead. But an automatic pre-scrub snapshot of a view cannot be attributed to user error, so the operation should not be aborted in that case. (It is an open question whether the more correct thing to do during pre-scrub snapshot would be to silently ignore views. Or perhaps they should be ignored in all cases except when the user explicitly asks to snapshot them, by name) Closes #10760. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-15 11:30:58 +03:00
Avi Kivity	e739f2b779	cql3: expr: make evaluate() return a cql3::raw_value rather than an expr::constant An expr::constant is an expression that happens to represent a constant, so it's too heavyweight to be used for evaluation. Right now the extra weight is just a type (which causes extra work by having to maintain the shared_ptr reference count), but it will grow in the future to include source location (for error reporting) and maybe other things. Prior to `e9b6171b5` ("Merge 'cql3: expr: unify left-hand-side and right-hand-side of binary_operator prepares' from Avi Kivity"), we had to use expr::constant since there was not enough type infomation in expressions. But now every expression carries its type (in programming language terms, expressions are now statically typed), so carrying types in values is not needed. So change evaluate() to return cql3::raw_value. The majority of the patch just changes that. The rest deals with some fallout: - cql3::raw_value gains a view() helper to convert to a raw_value_view, and is_null_or_unset() to match with expr::constant and reduce further churn. - some helpers that worked on expr::constant and now receive a raw_value now need the type passed via an additional argument. The type is computed from the expression by the caller. - many type checks during expression evaluation were dropped. This is a consequence of static typing - we must trust the expression prepare phase to perform full type checking since values no longer carry type information. Closes #10797	2022-06-15 08:47:24 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Benny Halevy	5bd2e0ccce	test: memtable_test: failed_flush_prevents_writes: validate flush using min_memtable_timestamp active_memtable().empty() becomes true once seal_active_memtable succeeds with _memtables->add_memtable(), not when it is able to flush the (once active) memtable. In contrast, min_memtable_timestamp() returns api::max_timestamp only if there is no data in any memtable. Fixes #10793 Backport notes: - Introduced in `f6d9d6175f` (currently in branch-5.0) - backport requires also `0e78ad50ea` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10798	2022-06-14 16:13:35 +03:00
Piotr Sarna	61ae0a46e3	Merge 'Three small fixes to Alternator's handling of GSIs... and LSIs' from Nadav Har'El This series includes three small fixes (and of course, tests) for various edge cases of GSI and LSI handling in Alternator: 1. We add the IndexArn that were missing in DescribeTable for indexes (GSI and LSI) 2. We forbid the same name to be used for both GSI and LSI (allowing it was a bug, not a feature) 3. We improve the error handling when trying to tag a GSI or LSI, which is not currently allowed (it's also not allowed in DynamoDB). Closes #10791 * github.com:scylladb/scylla: alternator: improve error handling when trying to tag a GSI or LSI alternator: forbid duplicate index (LSI and GSI) names alternator: add ARN for indexes (LSI and GSI)	2022-06-14 07:39:44 +02:00
Nadav Har'El	e20233dab1	alternator: improve error handling when trying to tag a GSI or LSI In issue #10786, we raised the idea of maybe allowing to tag (with TagResource) GSIs and LSIs, not just base tables. However, currently, neither DynamoDB nor Syclla allows it. So in this patch we add a test that confirms this. And while at it, we fix Alternator to return the same error message as DynamoDB in this case. Refs #10786. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-13 18:14:42 +03:00
Nadav Har'El	8866c326de	alternator: forbid duplicate index (LSI and GSI) names Adding an LSI and GSI with the same name to the same Alternator table should be forbidden - because if both exists only one of them (the GSI) would actually be usable. DynamoDB also forbids such duplicate name. So in this patch we add a test for this issue, and fix it. Since the patch involves a few more uses of the IndexName string, we also clean up its handling a bit, to use std::string_view instead of the old-style std::string&. Fixes #10789 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-13 18:14:42 +03:00
Nadav Har'El	00866a75d8	alternator: add ARN for indexes (LSI and GSI) DynamoDB gives an ARN ("Amazon Resource Name") to LSIs and GSIs. These look like BASEARN/index/INDEXNAME, where BASEARN is the ARN of the base table, and INDEXNAME is the name of the LSI or the GSI. These ARNs should be returned by DescribeTable as part of its description of each index, and this patch adds that missing IndexArn field. The ARN we're adding here is hardly useful (e.g., as explained in issue #10786, it can't be used to add tags to the index table), but nevertheless should exist for compatibility with DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-13 18:14:42 +03:00
Botond Dénes	b820aad3e0	Merge 'test/cql-pytest: skip another test on older, buggy, drivers' from Nadav Har'El Older versions of the Python Cassandra driver had a bug where a single empty page aborts a scan. The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately tiny pages, so it turns out that some of them are empty, so the test breaks on buggy versions of the driver, which cause the test to fail when run by developers who happen to have old versions of the driver. So in this small series we skip this test when running on a buggy version of the driver. Fixes #10763 Closes #10766 * github.com:scylladb/scylla: test/cql-pytest: skip another test on older, buggy, drivers test/cql-pytest: de-duplicate code checking for an old buggy driver	2022-06-13 16:06:11 +03:00
Kamil Braun	e87ca733f0	Merge 'test.py: fix bugs, add support for flaky tests' from Konstantin Osipov Marking a test as flaky allows to keep running it in CI rather than disable it when it's discovered that a test is flaky. Flaky tests, if they fail, show up as flaky in the output, but don't fail the CI. ``` kostja@hulk:~/work/scylla/scylla$ ./test.py cdc_with --repeat=30 --verbose Found 30 tests. ================================================================================ [N/TOTAL] SUITE MODE RESULT TEST ------------------------------------------------------------------------------ [1/30] cql debug [ FLKY ] cdc_with_lwt_test.2 9.36s [2/30] cql debug [ FLKY ] cdc_with_lwt_test.1 9.53s [3/30] cql debug [ PASS ] cdc_with_lwt_test.7 9.37s [4/30] cql debug [ PASS ] cdc_with_lwt_test.8 9.41s [5/30] cql debug [ PASS ] cdc_with_lwt_test.10 9.76s [6/30] cql debug [ FLKY ] cdc_with_lwt_test.9 9.71s ``` Closes #10721 * github.com:scylladb/scylla: test.py: add support for flaky tests test.py: make Test hierarchy resettable test.py: proper suite name in the log test.py: shutdown cassandra-python connection before exit	2022-06-10 19:00:36 +02:00
Konstantin Osipov	2b92d96c87	test.py: proper suite name in the log Use a nice suite name rather than an internal Python object key in the log. Fixes a regression introduced when addressing a style-related review remark.	2022-06-10 14:10:21 +03:00
Konstantin Osipov	950d606e38	test.py: shutdown cassandra-python connection before exit Shutdown cassandra-python connections before exit, to avoid warnings/exceptions at shutdown. Cassandra-python runs a thread pool and if connections are not shut down before exit, there could be a warning that the thread pool is not destroyed before exiting main.	2022-06-10 14:10:21 +03:00
Kamil Braun	aeba88cc29	Merge 'test.py: fixes for connection handling' from Alecco Change port type passed to Cassandra Python driver to int to avoid format errors in exceptions. Manually shutdown connections to avoid reconnects after tests are done (required by upcoming async pytests). Tests: (dev) Closes #10722 * github.com:scylladb/scylla: test.py: shutdown connection manually test.py: fix port type passed to Cassandra driver	2022-06-10 11:40:47 +02:00
Botond Dénes	1c8c693ff7	Merge "Redefine Leveled compaction backlog" from Raphael S. Carvalho " This series is a consequence of the work started by: "compaction: LCS: Fix inefficiency when pushing SSTables to higher levels" `9de7abdc80` "Redefine Compaction Backlog to tame compaction aggressiveness" `d8833de3bb` The backlog definition for leveled is incorrectly built on the assumption that the world must reach the state of zero amplification, i.e. everything in the last level. The actual goal is space amplification of 1.1. In reality, LCS just wants that for every level L, level L is fan_out=10 times larger than L-1. See more in commit `9de7abdc80` which adjusts LCS to conform to this goal. If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1 But today, LCS calculates high backlog for the layout above, as it will only be satisfied once everything is promoted to the maximum level. That's completely disconnected from what the strategy actually wants. Therefore, a mismatch. With today's definition, the backlog for any SSTable is: sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out where Lmax = maximum level, and fan_out = LCS' fan out which is 10 by default That's essentially calculating the total cost for data in the SSTable to climb up to the maximum level. Of course, if a SSTable is at the maximum level, (Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero. Take a look at this example: If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G 0.16G = LCS' default fragment size Maximum level (Lmax in formula) can be easily 3 as: log10 of (30G/0.16G=~187 sstables)) = ~2.27 ~2.27 means that data has exceeded level 2 capacity and so needs 3 levels. So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk memory ratio), that's normalized backlog of ~15, which translates into additional ~500 shares. That's halfway to full compaction speed. With more files in higher levels, we can easily get to a normalized backlog above 30, resulting in 1k shares. The suboptimal backlog definition causes either table using LCS or coexisting tables to run with more shares than needed, causing compaction to steal resources, resulting in higher latency and reduced throughput. To solve this problem, a new formula is used which will basically calculate the amount of work needed to achieve the layout goal. We no longer want to promote everything to the last level, but instead we'll incrementally calculate the backlog in each level L, which is the amount of work needed such that the next level L + 1 is at least fan_out times bigger. Fixes #10583. Results ===== image: https://user-images.githubusercontent.com/1409139/168713675-d5987d09-7011-417c-9f91-70831c069382.png The patched version correctly clears the backlog, meaning that once LCS is satisfied, backlog is 0. Therefore, next compaction either from this table or another won't run unnecessarily aggressive. p99 read and write latency have clearly improved. throughput is also more stable. " * 'LCS_backlog_revamp' of https://github.com/raphaelsc/scylla: tests: sstable_compaction_test: Adjust controller unit test for LCS compaction: Redefine Leveled compaction backlog	2022-06-10 09:21:13 +03:00
Raphael S. Carvalho	079283193a	tests: sstable_compaction_test: Adjust controller unit test for LCS The controller unit test for LCS was only creating level 0 SSTables. As level 0 falls back to STCS controller, it means that we weren't actually testing LCS controller. So let's adjust the unit test to account for LCS fan_out, which is 10 instead of 4, and also allow creation of SSTables on higher levels. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-06-09 14:21:40 -03:00
Nadav Har'El	b2444b6e9f	test/cql-pytest: skip another test on older, buggy, drivers Older versions of the Python Cassandra driver had a bug, detected by the driver_bug_1 fixture, where a single empty page aborts a scan. The test test_secondary_index.py::test_filter_and_limit uses filtering and deliberately tiny pages, so it turns out that some of them are empty, so the test breaks on buggy versions of the driver, which causes the test to fail when run by developers who happen to have old versions of the driver. So in this patch we use the driver_bug_1 fixture, to skip this test when running on a buggy version of the driver. Fixes #10763 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-09 14:37:45 +03:00
Nadav Har'El	c8a3d0758a	test/cql-pytest: de-duplicate code checking for an old buggy driver We have in test_filtering.py two tests which fail when running on an old version of the Python driver which has a specific bug, so we skip those tests if the buggy driver is installed. But the code to check the driver version is duplicated twice, so in this patch we move the version-checking-and-skipping code to a fixture, which we can use twice. The motivation is that in the next patch we will want to introduce a third use of the same code - and a fixture is cleaner than a third duplicate. This patch is supposed to be code-movement only, without functional changes. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2022-06-09 14:23:40 +03:00
Gleb Natapov	70b7b2b4d6	storage_service: do not pass the raft group manager to storage_service constructor Reduce the storage_service's dependency on the raft group manager. The group manager is needed only during bootstrap and in an rpc handler, so pass it to those functions directly.	2022-06-09 09:40:55 +03:00
Nadav Har'El	75c2bd78ae	test/alternator: reproducer for GetBatchItem duplicate keys It turns out that DynamoDB forbids requesting the same item more than once in a GetBatchItem request. Trying to do it would obviously be a waste, but DynamoDB outright refuses it - and Alternator currently doesn't (refs #10757). The test currently passes on DynamoDB and fails on Alternator, so it is marked xfail. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #10758	2022-06-09 07:04:50 +02:00
Petr Gusev	0450974057	cql3_type::raw_collection: handle unknown types first The issue is about handling errors when the user specifies something strange instead of a type, e.g. CREATE TABLE try1 (a int PRIMARY KEY, b list<zzz>): * the error message only talks about collections, while zzz could also be an UDT; * the same error message is given even when zzz is not a valid collection or UDT name. The first point has already been fixed, now Scylla says 'Non-frozen user types or collections are not allowed inside collections: list<zzz>'. This commit fixes the second. Whether the type is a valid UDT or not is checked in cql3_type::raw_ut::prepare_internal, but 'non-frozen' check triggers first in cql3_type::raw_collection::prepare_internal, before we recursively get to the argument types of the collection. The patch reverses the order here, first thing we recurse and ensure that the collection argument types are valid, and only then we apply the collection checks. A side effect of this is that the error messages of the checks in raw_collection will include the keyspace name, because it will now be assigned in raw_ut::prepare_internal before them. The patch affects the validation order, so in case of list<zzz<xxx>> the message could be different, but it doesn't seem to be possible according to the Cql grammar. Examples: create type ut2 (a int, b list<ut1>); --> error('Unknown type ks.ut1') create type ut1 (a int); create type ut2 (a int, b list<ut1>); --> error('Non-frozen user types or collections are not allowed inside collections: list<ks.ut1>') create type ut2 (a int, b list<frozen<ut1>>); --> OK Fixes: scylladb#3541 Closes #10726	2022-06-07 11:16:12 +02:00
Tomasz Grabiec	beadd248e3	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	9135d1fd1f	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	374234cf76	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-06 19:25:40 +02:00
Tomasz Grabiec	604e720706	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-06 19:23:37 +02:00
Tomasz Grabiec	0e3c4fc641	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-06 19:23:37 +02:00
Tomasz Grabiec	0e78ad50ea	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries.	2022-06-06 19:23:37 +02:00
Alejo Sanchez	98061c8960	test.py: shutdown connection manually To prevent async scheduling issues of reconnection after tests are done, manually close the connection after fixture ends. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-03 12:09:18 +02:00
Alejo Sanchez	17afcff228	test.py: fix port type passed to Cassandra driver Port is expected to be int, not str. Using a str causes errors for exception formatting. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2022-06-03 12:09:06 +02:00
Botond Dénes	49215fcff7	Merge 'Remove `flat_mutation_reader` (v1)' from Michael Livshin - Introduce a simpler substitute for `flat_mutation_reader`-resulting-from-a-downgrade that is adequate for the remaining uses but is _not_ a full-fledged reader (does not redirect all logic to an `::impl`, does not buffer, does not really have `::peek()`), so hopefully carries a smaller performance overhead. The name `mutation_fragment_v1_stream` is kind of a mouthful but it's the best I have - (not tests) Use the above instead of `downgrade_to_v1()` - Plug it in as another option in `mutation_source`, in and out - (tests) Substitute deliberate uses of `downgrade_to_v1()` with `mutation_fragment_v1_stream()` - (tests) Replace all the previously-overlooked occurrences of `mutation_source::make_reader()` with `mutation_source::make_reader_v2()`, or with `mutation_source::make_fragment_v1_stream()` where deliberate or still required (see below) - (tests) This series still leaves some tests with `mutation_fragment_v1_stream` (i.e. at v1) where not called for by the test logic per se, because another missing piece of work is figuring out how to properly feed `mutation_fragment_v2` (i.e. range tombstone changes) to `mutation_partition`. While that is not done (and I think it's better to punt on it in this PR), we have to produce `mutation_fragment` instances in tests that `apply()` them to `mutation_partition`, thus we still use downgraded readers in those tests - Remove the `flat_mutation_reader` class and things downstream of it Fixes #10586 Closes #10654 * github.com:scylladb/scylla: fix "ninja dev-headers" flat_mutation_reader ist tot tests: downgrade_to_v1() -> mutation_fragment_v1_stream() tests: flat_reader_assertions: refactor out match_compacted_mutation() tests: ms.make_reader() -> ms.make_fragment_v1_stream() repair/row_level: mutation_fragment_v1_stream() instead of downgrade_to_v1() stream_transfer_task: mutation_fragment_v1_stream() instead of downgrade_to_v1() sstables_loader: mutation_fragment_v1_stream() instead of downgrade_to_v1() mutation_source: add ::make_fragment_v1_stream() introduce mutation_fragment_v1_stream tests: ms.make_reader() -> ms.make_reader_v2() tests: remove test_downgrade_to_v1_clear_buffer() mutation_source_test: fix indentation tests: remove some redundant calls to downgrade_to_v1() tests: remove some to-become-pointless ms.make_reader()-using tests tests: remove some to-become-pointless reader downgrade tests	2022-06-03 07:26:29 +03:00
Kamil Braun	72f629c2b6	test: cdc_enable_disable_test: remove non-determinism The test sometimes fails because the order of rows in the SELECT results depends on how stream IDs for the different partition keys get generated. In some runs the stream ID for pk=1 may go before the stream ID for pk=4, in some runs the other way. The fix is to use the same partition key but different clustering keys for the different rows. Refs: #10601 Closes #10718	2022-06-02 19:40:07 +03:00

1 2 3 4 5 ...

3218 Commits