scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-08 16:03:20 +00:00

Author	SHA1	Message	Date
Kamil Braun	2032d7dbe4	test: scylla_cluster: return the new IP from `change_ip` API Also simplify the API by getting rid of `ActionReturn` and returning errors through exceptions (which are correctly forwarded to the client for some time already).	2023-07-06 10:24:46 +02:00
Kamil Braun	00f51ea753	test: node replace with `ignore_dead_nodes` test Regression test for #14487 on steroids. It performs 3 consecutive node replace operations, starting with 3 dead nodes. In order to have a Raft majority, we have to boot a 7-node cluster, so we enable this test only in one mode; the choice was between `dev` and `release`, I picked `dev` because it compiles faster and I develop on it.	2023-07-06 10:24:46 +02:00
Kamil Braun	9b136ee574	test: scylla_cluster: accept `ignore_dead_nodes` in `ReplaceConfig`	2023-07-06 10:24:46 +02:00
Tomasz Grabiec	c25201c1a3	Merge 'view: fix range tombstone handling on flushes in view_updating_consumer' from Michał Chojnowski View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes https://github.com/scylladb/scylladb/issues/14503 Closes #14502 * github.com:scylladb/scylladb: test: view_build_test: add range tombstones to test_view_update_generator_buffering test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations view_updating_consumer: make buffer limit a variable view: fix range tombstone handling on flushes in view_updating_consumer	2023-07-05 21:21:43 +02:00
Michał Chojnowski	f6203f2bd4	test: view_build_test: add range tombstones to test_view_update_generator_buffering This patch adds a full-range tombstone to the compacted mutation. This raises the coverage of the test. In particular, it reproduces issue #14503, which should have been caught by this test, but wasn't.	2023-07-05 17:33:49 +02:00
Michał Chojnowski	aab10402ce	test: view_build_test: add test_view_udate_generator_buffering_with_random_mutations A random mutation test for view_updating_consumer's buffering logic. Reproduces #14503.	2023-07-05 17:33:49 +02:00
Michał Chojnowski	ac29b6f198	view_updating_consumer: make buffer limit a variable The limit doesn't change at runtime, but we this patch makes it variable for unit testing purposes.	2023-07-05 17:33:47 +02:00
Raphael S. Carvalho	5d34db2532	test: Extend sstable partition skipping test to cover fast forward using token Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-07-05 11:38:58 -03:00
Pavel Emelyanov	e91f95a629	Merge 's3/test: restructure object_store test into a pytest based test suite' from Kefu Chai in this series, test/object_storage is restructured into a pytest based test. this paves the road to a test suites covers more use cases. so we can some more lower-level tests for tiered/caching-store. Closes #14165 * github.com:scylladb/scylladb: s3/test: do not return ip in managed_cluster() s3/test: verify the behavior with asserts s3/test: restructure object_store/run into a pytest s3/test: extract get_scylla_with_s3_cmd() out s3/test: s/restart_with_dir/kill_with_dir/ s3/test: vendor run_with_dir() and friends s3/test: remove get_tempdir() s3/test: extract managed_cluster() out	2023-07-05 15:40:43 +03:00
Kefu Chai	9080f8842b	s3/test: do not return ip in managed_cluster() let's just use cluster.contact_points for retrieving the IP address of the scylla node in this single-node cluster. so the name of managed_cluster() is less weird. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 17:07:39 +08:00
Kefu Chai	ec6410653f	s3/test: verify the behavior with asserts instead of assigning to "success", let's use assert for this purpose. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 17:07:21 +08:00
Kefu Chai	471d75c6c6	s3/test: restructure object_store/run into a pytest instead of using a single run to perform the test, restructure it into a pytest based test suite with a single test case. this should allow us to add more tests exercising the object-storage and cached/tierd storage in future. * add fixtures so they can be reused by tests * use tmpdir fixture for managing the tmpdir, see https://docs.pytest.org/en/6.2.x/tmpdir.html#the-tmpdir-fixture * perform part of the teardown in the "test_tempdir()" fixture * change the type of test from "Run" to "Python" * rename "run" to "test_basic.py" * optionally start the minio server if the settings are not found in command line or env variables, so that the tests are self-contained without the fixture setup by test.py. * instead of sys.exit(), use assert statement, as this is what pytest uses. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 17:05:13 +08:00
Petr Gusev	b69bc97673	repair_test: add test_reader_with_different_strategies	2023-07-05 13:02:17 +04:00
Kefu Chai	bffaf84395	s3/test: extract get_scylla_with_s3_cmd() out * define a dedicated S3_server class which duck types MinioServer. it will be used to represent S3 server in place of MinioServer if S3 is used for testing * prepare object_storage.yaml in get_scylla_with_s3(), so it is more clear that we are using the same set of settings for launching scylla Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:49:04 +08:00
Kefu Chai	f74218f434	s3/test: s/restart_with_dir/kill_with_dir/ replace the restart_with_dir() with kill_with_dir(), so that we can simplify the usage of managed_cluster() by enabling it to start and stop the single-node cluster. with this change, the caller does not need to run the scylla and pass its pid to this function any more. since the restart_with_dir() call is superseded by managed_cluster(), which tears down the cluster, teardown() is now only responsible to print out the log file. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:48:25 +08:00
Kefu Chai	a6bb5864ff	s3/test: vendor run_with_dir() and friends so we don't need to mess up with cql-pytest/run.py, which is use by cql-pytest. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:48:04 +08:00
Kefu Chai	b45049c968	s3/test: remove get_tempdir() to match with another call of managed_cluster(), so it's clear that we are just reusing test_tempdir. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:45:14 +08:00
Kefu Chai	a5a87d81c6	s3/test: extract managed_cluster() out for setting up the cluster and tearing down it. this helps to indent the code so that it is visually explicit the lifecycle of the cluster. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 16:45:14 +08:00
Kefu Chai	1faf50fc05	test/pylib: do not hardwire alias to "local" define a variable for it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 15:58:41 +08:00
Kefu Chai	d55cfdc152	test/pylib: retry if minio_server is not ready there is chance that minio_server is not ready to serve after launching the server executable process. so we need to retry until the first "mc" command is able to talk to it. in this change, add method `mc()` is added to run minio client, so we can retry the command before it timeouts. and it allows us to ignore the failure or specify the timeout. this should ready the minio server before tests start to connect to it. Fixes #1719 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-05 15:57:59 +08:00
Michał Chojnowski	5ad0846bff	view: fix range tombstone handling on flushes in view_updating_consumer View update routines accept `mutation` objects. But what comes out of staging sstable readers is a stream of mutation_fragment_v2 objects. To build view updates after a repair/streaming, we have to convert the fragment stream into `mutation`s. This is done by piping the stream to mutation_rebuilder_v2. To keep memory usage limited, the stream for a single partition might have to be split into multiple partial `mutation` objects. view_update_consumer does that, but in improper way -- when the split/flush happens inside an active range tombstone, the range tombstone isn't closed properly. This is illegal, and triggers an internal error. This patch fixes the problem by closing the active range tombstone (and reopening in the same position in the next `mutation` object). The tombstone is closed just after the last seen clustered position. This is not necessary for correctness -- for example we could delay all processing of the range tombstone until we see its end bound -- but it seems like the most natural semantic. Fixes #14503	2023-07-04 20:33:21 +02:00
Marcin Maliszkiewicz	6424dd5ec4	alternator: close output_stream when exception is thrown during response streaming When exception occurs and we omit closing output_stream then the whole process is brought down by an assertion in ~output_stream. Fixes https://github.com/scylladb/scylladb/issues/14453 Relates https://github.com/scylladb/scylladb/issues/14403 Closes #14454	2023-07-04 16:15:08 +03:00
Kefu Chai	c005b6dce0	test/pylib: chmod +x minio_server.py add a shebang line. so we can just launch a minio_server using ```console test/pylib/minio_server.py --host 127.0.0.1 ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 13:19:34 +08:00
Kefu Chai	2bae0b9aa8	test/pylib: allow run minio_server.py as a stand-alone tool this would allow developer to run a minio server for testing, for instance, s3_test, using something like: ```console $ python3 test/pylib/minio_server.py --host 127.0.0.1 tempdir='/tmp/tmpfoobar-minio' export S3_SERVER_ADDRESS_FOR_TEST=127.0.0.1 export S3_SERVER_PORT_FOR_TEST=900 export S3_PUBLIC_BUCKET_FOR_TEST=testbucket ``` and developer is supposed to copy-and-paste the `export` commands to prepare the environmental variables for the test using the minio server. the tempdir is used for the rundir of minio, and it is also used for holding the log file of this tool. one might want to check it when necessary. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-04 13:14:42 +08:00
Tomasz Grabiec	f2ed9fcd7e	schema_mutations, migration_manager: Ignore empty partitions in per-table digest Schema digest is calculated by querying for mutations of all schema tables, then compacting them so that all tombstones in them are dropped. However, even if the mutation becomes empty after compaction, we still feed its partition key. If the same mutations were compacted prior to the query, because the tombstones expire, we won't get any mutation at all and won't feed the partition key. So schema digest will change once an empty partition of some schema table is compacted away. Tombstones expire 7 days after schema change which introduces them. If one of the nodes is restarted after that, it will compute a different table schema digest on boot. This may cause performance problems. When sending a request from coordinator to replica, the replica needs schema_ptr of exact schema version request by the coordinator. If it doesn't know that version, it will request it from the coordinator and perform a full schema merge. This adds latency to every such request. Schema versions which are not referenced are currently kept in cache for only 1 second, so if request flow has low-enough rate, this situation results in perpetual schema pulls. After `ae8d2a550d`, it is more liekly to run into this situation, because table creation generates tombstones for all schema tables relevant to the table, even the ones which will be otherwise empty for the new table (e.g. computed_columns). This change inroduces a cluster feature which when enabled will change digest calculation to be insensitive to expiry by ignoring empty partitions in digest calculation. When the feature is enabled, schema_ptrs are reloaded so that the window of discrepancy during transition is short and no rolling restart is required. A similar problem was fixed for per-node digest calculation in 18f484cc753d17d1e3658bcb5c73ed8f319d32e8. Per-table digest calculation was not fixed at that time because we didn't persist enabled features and they were not enabled early-enough on boot for us to depend on them in digest calculation. Now they are enabled before non-system tables are loaded so digest calculation can rely on cluster features. Fixes #4485.	2023-07-03 23:06:55 +02:00
Nadav Har'El	ec77172b4b	Merge 'cql3: convert the SELECT clause evaluation phase to expressions' from Avi Kivity SELECT clause components (selectors) are currently evaluated during query execution using a stateful class hierarchy. This state is needed to hold intermediate state while aggregating over multiple rows. Because the selectors are stateful, we must re-create them each query using a selector_factory hierarchy. We'd like to convert all of this to the unified expression evaluation machinery, so we can have just one grammar for expressions, and just one way to evaluate expressions, but the statefulness makes this complex. In commit `59ab9aac44` "(Merge 'functions: reframe aggregate functions in terms of scalar functions' from Avi Kivity)", we made aggregate functions stateless, moving their state to aggregate_function_selector::_accumulator, and therefore into the class hierarchy we're addressing now. Another reason for keeping state is that selectors that aren't aggregated capture the first value they see in a GROUP BY group. Since expressions can't contain state directly, we break apart expressions that contain aggregate functions into two: an inner expression that processes incoming rows within a group, and an outer expression that generates the group's output. The two expressions communicate via a newly introduced expression element: a temporary. The problem of non-aggregated columns requiring state is solved by encapsulating those columns in an internal aggregate function, called the "first" function. In terms of performance, this series has little effect, since the common case of selectors that only contain direct column references without transformations is evaluated via a fast path (`simple_selection`). This fast-path is preserved with almost no changes. While the series makes it possible to start to extend the grammar and unify expression syntaxes, it does not do so. The grammar is unchanged. There is just one breaking change: the `SELECT JSON` statement generates json object field names based on the input selectors. In one case the name of the field has changed, but it is an esoteric case (where a function call is selected as part of `SELECT JSON`), and the new behavior is compatible with Cassandra. Closes #14467 * github.com:scylladb/scylladb: cql3: selection: drop selector_factories, selectables, and selectors cql3: select_statement: stop using selector_factories in SELECT JSON cql3: selection: don't create selector_factories any more cql3: selection: collect column_definitions using expressions cql3: selection: reimplement selection::is_aggregate() cql3: selection: evaluate aggregation queries via expr::evaluate() cql3: selection, select_statement: fine tune add_column_for_post_processing() usage cql3: selection: evaluate non-aggregating complex selections using expr::evaluate() cql3: selection: store primary key in result_set_builder cql3: expression: fix field_selection::type interpretation by evaluate() cql3: selection: make result_set_builder::current non-optional<> cql3: selection: simplify row/group processing cql3: selection: convert requires_thread to expressions cql: selection: convert used_functions() to expressions cql3: selection: convert is_reducible/get_reductions to expressions cql3: selection: convert is_count() to expressions cql3: selection convert contains_ttl/contains_writetime to work on expressions cql3: selection: make simple_selectors stateless cql3: expression: add helper to split expressions with aggregate functions cql3: selection: short-circuit non-aggregations cql3: selection: drop validate_selectors cql3: select_statement: force aggregation if GROUP BY is used cql3: select_statement: levellize aggregation depth cql3: selection: skip first_function when collecting metadata cql3: select_statement: explicitly disable automatic parallelization with no aggregates cql3: expression: introduce temporaries cql3: select_statement: use prepared selectors cql3: selection: avoid selector_factories in collect_metadata() cql3: expressions: add "metadata mode" formatter for expressions cql3: selection: convert collect_metadata() to the prepared expression domain cql3: selection: convert processes_selection to work on prepared expressions cql3: selection: prepare selectors earlier cql3: raw_selector: deinline cql3: expression: reimplement verify_no_aggregate_functions() cql3: expression: add helpers to manage an expression's aggregation depth cql3: expression: improve printing of prepared function calls cql3: functions: add "first" aggregate function	2023-07-03 23:21:33 +03:00
Avi Kivity	d9cf81f1a6	cql3: select_statement: stop using selector_factories in SELECT JSON SELECT JSON uses selector_factories to obtain the names of the fields to insert into the json object, and we want to drop selector_factories entirely. Switch instead to the ":metadata" mode of printing expressions, which does what we want. Unfortunately, the switch changes how system functions are converted into field names. A function such as unixtimestampof() is now rendered as "system.unixtimestampof()"; before it did not have the keyspace prefix. This is a compatiblity problem, albeit an obscure one. Since the new behavior matches Cassandra, and the odds of hitting this are very low, I think we can allow the change.	2023-07-03 19:45:17 +03:00
Avi Kivity	0021f77e30	cql3: expression: fix field_selection::type interpretation by evaluate() field_selection::type refers to the type of the selection operation, not the type of the structure being selected. This is what prepare_expression() generates and how all other expression elements work, but evaluate() for field_selection thinks it's the type of the structure, and so fails when it gets an expression from prepare_expression(). Fix that, and adjust the tests.	2023-07-03 19:45:17 +03:00
Avi Kivity	b1b4a18ad8	cql3: expression: add helpers to manage an expression's aggregation depth We define the "aggregation depth" of an expression by how many nested aggregation functions are applied. In CQL/SQL, legal values are 0 and 1, but for generality we deal with any aggregation depth. The first helper measures the maximum aggregation depth along any path in the expression graph. If it's 2 or greater, we have something like max(max(x)) and we should reject it (though these helpers don't). If we get 1 it's a simple aggregation. If it's zero then we're not aggregating (though CQL may decide to aggregate anyway if GROUP BY is used). The second helper edits an expression to make sure the aggregation depth along any path that reaches a column is the same. Logically, `SELECT x, max(y)` does not make sense, as one is a vector of values and the other is a scalar. CQL resolves the problem by defining x as "the first value seen". We apply this resolution by converting the query to `SELECT first(x), max(y)` (where `first()` is an internal aggregate function), so both selectors refer to scalars that consume vectors. When a scalar is consumed by an aggregate function (for example, `SELECT max(x), min(17)` we don't have to bother, since a scalar is implicity promoted to a vector by evaluating it every row. There is some ambiguity if the scalar is a non-pure function (e.g. `SELECT max(x), min(random())`, but it's not worth following. A small unit test is added.	2023-07-03 19:45:16 +03:00
Alejo Sanchez	520bd90008	test/boost/memtable_test: split test plain/reverse Split long running test test_memtable_with_many_versions_conforms_to_mutation_source to 2 tests for _plain and _reverse. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #14447	2023-07-03 15:20:12 +03:00
Michał Jadwiszczak	58eb7a45b7	cql-pytest:test_describe: add test for generic UDT/UDF/UDA desc	2023-06-30 14:50:08 +02:00
Piotr Dulikowski	ee9bfb583c	combined: mergers: remove recursion in operator()() In mutation_reader_merger and clustering_order_reader_merger, the operator()() is responsible for producing mutation fragments that will be merged and pushed to the combined reader's buffer. Sometimes, it might have to advance existing readers, open new and / or close some existing ones, which requires calling a helper method and then calling operator()() recursively. In some unlucky circumstances, a stack overflow can occur: - Readers have to be opened incrementally, - Most or all readers must not produce any fragments and need to report end of stream without preemption, - There has to be enough readers opened within the lifetime of the combined reader (~500), - All of the above needs to happen within a single task quota. In order to prevent such a situation, the code of both reader merger classes were modified not to perform recursion at all. Most of the code of the operator()() was moved to maybe_produce_batch which does not recur if it is not possible for it to produce a fragment, instead it returns std::nullopt and operator()() calls this method in a loop via seastar::repeat_until_value. A regression test is added. Fixes: scylladb/scylladb#14415 Closes #14452	2023-06-30 12:07:13 +03:00
Kamil Braun	ff386e7a44	service: raft: force initial snapshot transfer in new cluster When we upgrade a cluster to use Raft, or perform manual Raft recovery procedure (which also creates a fresh group 0 cluster, using the same algorithm as during upgrade), we start with a non-empty group 0 state machine; in particular, the schema tables are non-empty. In this case we need to ensure that nodes which join group 0 receive the group 0 state. Right now this is not the case. In previous releases, where group 0 consisted only of schema, and schema pulls were also done outside Raft, those nodes received schema through this outside mechanism. In `91f609d065` we disabled schema pulls outside Raft; we're also extending group 0 with other things, like topology-specific state. To solve this, we force snapshot transfers by setting the initial snapshot index on the first group 0 server to `1` instead of `0`. During replication, Raft will see that the joining servers are behind, triggering snapshot transfer and forcing them to pull group 0 state. It's unnecessary to do this for cluster which bootstraps with Raft enabled right away but it also doesn't hurt, so we keep the logic simple and don't introduce branches based on that. Extend Raft upgrade tests with a node bootstrap step at the end to prevent regressions (without this patch, the step would hang - node would never join, waiting for schema). Fixes: #14066 Closes #14336	2023-06-29 22:46:42 +02:00
Konstantin Osipov	3d81408a58	test.py: make `experimental: raft` the default for all tests Make sure all tests use the new centralized topology coordinator. This is a step forward towards maturing the coordinator implementation. Closes #14039	2023-06-29 14:44:00 +02:00
Botond Dénes	2a58b4a39a	Merge 'Compaction resharding tasks' from Aleksandra Martyniuk Task manager's tasks covering resharding compaction on table and shard level. Closes #14044 * github.com:scylladb/scylladb: test: extend test_compaction_task.py to test resharding compaction compaction: add shard_reshard_sstables_compaction_task_impl compaction: invoke resharding on sharded database compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run() replica: delete unused functions and struct compaction: add reshard_sstables_compaction_task_impl compaction: replica: copy struct and functions from distributed_loader.cc compaction: create resharding_compaction_task_impl	2023-06-29 12:10:54 +03:00
Nadav Har'El	dd63169077	Merge 'test/boost/index_with_paging_test: reduce running time' from Alecco Reduce test string value size, parallelize inserts, and use a prepared statement, The debug running time for this tests is reduced from 13:18 to 7:52. Refs #13905 Closes #14380 * github.com:scylladb/scylladb: test/boost/index_with_paging_test: parallel insert test/boost/index_with_paging_test: prepared statement test/boost/index_with_paging_test: reduce running time	2023-06-29 10:45:01 +03:00
Avi Kivity	f6f974cdeb	cql3: selection: fix GROUP BY, empty groups, and aggregations A GROUP BY combined with aggregation should produce a single row per group, except for empty groups. This is in contrast to an aggregation without GROUP BY, which produces a single row no matter what. The existing code only considered the case of no grouping and forced a row into the result, but this caused an unwanted row if grouping was used. Fix by refining the check to also consider GROUP BY. XFAIL tests are relaxed. Fixes #12477. Note, forward_service requires that aggregation produce exactly one row, but since it can't work with grouping, it isn't affected. Closes #14399	2023-06-28 18:56:22 +03:00
Kamil Braun	b912eeade5	Merge 'merge raft commands to group0 before applying them whenever possible' from Gleb Since most group0 commands are just mutations it is easy to combine them before passing them to a subsystem they destined to since it is more efficient. The logic that handles those mutations in a subsystem will run once for each batch of commands instead of for each individual command. This is especially useful when a node catches up to a leader and gets a lot of commands together. The patch here does exactly that. It combines commands into a single command if possible, but it preserves an order between commands, so each time it encounters a command to a different subsystem it flushes already combined batch and starts a new one. This extra safety assumes that there are dependencies between subsystems managed by group0, so the order matters. It may be not the case now, but we prefer to be on a safe side. Broadcast table commands are not mutations, so they are never combined. * 'raft-merge-cmds' of https://github.com/gleb-cloudius/scylla: test: add test for group0 raft command merging service: raft: respect max mutation size limit when persisting raft entries group0_state_machine: merge commands before applying them whenever possible	2023-06-28 17:21:07 +02:00
Alejo Sanchez	d4697ed21e	test/boost/index_with_paging_test: parallel insert Parallelize inserts for long-running test_index_with_paging. Run time in debug mode reduced by 1 minute 48 seconds. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 16:11:58 +02:00
Alejo Sanchez	70a3179888	test/boost/index_with_paging_test: prepared statement Prepare statement for insert. Run time in debug mode reduced by 9 seconds. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 14:49:21 +02:00
Michał Jadwiszczak	0a8fcead08	cql3: Specify arguments types in UDA creation errors Display not only function name but also expected arguments if `state_function` or `final_function` was not found. Fixes: #12088 Closes #14278	2023-06-28 15:27:49 +03:00
Alejo Sanchez	48d24269f1	test/boost/index_with_paging_test: reduce running time Reduce test string value size for test_index_with_paging from 4096 to 100. With 100 bytes it should make the base row significantly larger than the key so the test will exercise both types of paging in the scanning code. The debug running time for this tests is reduced from 9 minutes to 6 minutes. Refs #13905 Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>	2023-06-28 13:55:52 +02:00
Nadav Har'El	49c8c06b1b	Merge 'cql: fix crash on empty clustering range in LWT' from Jan Ciołek LWT queries with empty clustering range used to cause a crash. For example in: ```cql UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3 ``` The range of `c` is empty - there are no valid values. This caused a segfault when accessing the `first` range: ```c++ op.ranges.front() ``` Cassandra rejects such queries at the preparation stage. It doesn't allow two `EQ` restriction on the same clustering column when an IF is involved. We reject them during runtime, which is a worse solution. The user can prepare a query with `c = ? AND c = ?`, and then run it, but unexpectedly it will throw an `invalid_request_exception` when the two bound variables are different. We could ban such queries as well, we already ban the usage of `IN` in conditional statements. The problem is that this would be a breaking change. A better solution would be to allow empty ranges in `LWT` statements. When an empty range is detected we just wouldn't apply the change. This would be a larger change, for now let's just fix the crash. Fixes: https://github.com/scylladb/scylladb/issues/13129 Closes #14429 * github.com:scylladb/scylladb: modification_statement: reject conditional statements with empty clustering key statements/cas_request: fix crash on empty clustering range in LWT	2023-06-28 14:43:54 +03:00
Aleksandra Martyniuk	bf3e0744c1	test: extend test_compaction_task.py to test resharding compaction	2023-06-28 11:43:12 +02:00
Jan Ciolek	ccdb26bf9e	statements/cas_request: fix crash on empty clustering range in LWT LWT queries with empty clustering range used to cause a crash. For example in: ```cql UPDATE tab SET r = 9000 WHERE p = 1 AND c = 2 AND c = 2000 IF r = 3 ``` The range of `c` is empty - there are no valid values. This caused a segfault when accessing the `first` range: ```c++ op.ranges.front() ``` To fix it let's throw en exception when the clustering range is empty. Cassandra also rejects queries with `c = 1 AND c = 2`. There's also a check for empty partition range, as it used to crash in the past, can't really hurt to add it. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2023-06-28 10:18:06 +02:00
Kamil Braun	96bc78905d	readers: evictable_reader: don't accidentally consume the entire partition The evictable reader must ensure that each buffer fill makes forward progress, i.e. the last fragment in the buffer has a position larger than the last fragment from the previous buffer-fill. Otherwise, the reader could get stuck in an infinite loop between buffer fills, if the reader is evicted in-between. The code guranteeing this forward progress had a bug: the comparison between the position after the last buffer-fill and the current last fragment position was done in the wrong direction. So if the condition that we wanted to achieve was already true, we would continue filling the buffer until partition end which may lead to OOMs such as in #13491. There was already a fix in this area to handle `partition_start` fragments correctly - #13563 - but it missed that the position comparison was done in the wrong order. Fix the comparison and adjust one of the tests (added in #13563) to detect this case. Fixes #13491	2023-06-27 14:37:29 +02:00
Kamil Braun	5800ce8ddd	test: flat_mutation_reader_assertions: squash `r_t_c`s with the same position test_range_tombstones_v2 is too strict for this reader -- it expects a particular sequence of `range_tombstone_change`s, but multishard_combining_reader, when tested with a small buffer, may generate -- as expected -- additional (redundant) range tombstone change pairs (end+start). Currently we don't observe these redundant fragments due to a bug in `evictable_reader_v2` but they start appearing once we fix the bug and the test must be prepared first. To prepare the test, modify `flat_reader_assertions_v2` so it squashes redundant range tombstone change pairs. This happens only in non-exact mode. Enable exact mode in `test_sstable_reversing_reader_random_schema` for comparing two readers -- the squashing of `r_t_c`s may introduce an artificial difference.	2023-06-27 14:37:25 +02:00
Gleb Natapov	945f476363	test: add test for group0 raft command merging Add a test that submits 3 large commands each one a little bit larger than 1/3 of maximum mutation size. Check that in the end 2 command were executed (first 2 were merged and third was executed separately).	2023-06-27 14:59:55 +03:00
Botond Dénes	f5e3b8df6d	Merge 'Optimize creation of reader excluding staging for view building' from Raphael "Raph" Carvalho View building from staging creates a reader from scratch (memtable \+ sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: ``` + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator()() const::{lambda()https://github.com/scylladb/scylladb/issues/1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert ``` That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from `INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s` to `INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s` Refs https://github.com/scylladb/scylladb/issues/14089. Fixes scylladb/scylladb#14244. Closes #14364 * github.com:scylladb/scylladb: table: Optimize creation of reader excluding staging for view building view_update_generator: Dump throughput and duration for view update from staging utils: Extract pretty printers into a header	2023-06-27 07:25:30 +03:00
Raphael S. Carvalho	1d8cb32a5d	table: Optimize creation of reader excluding staging for view building View building from staging creates a reader from scratch (memtable + sstables - staging) for every partition, in order to calculate the diff between new staging data and data in base sstable set, and then pushes the result into the view replicas. perf shows that the reader creation is very expensive: + 12.15% 10.75% reactor-3 scylla [.] lexicographical_tri_compare<compound_type<(allow_prefixes)0>::iterator, compound_type<(allow_prefixes)0>::iterator, legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator()(managed_bytes_basic_view<(mutable_view)0>, managed_bytes + 10.01% 9.99% reactor-3 scylla [.] boost::icl::is_empty<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 8.95% 8.94% reactor-3 scylla [.] legacy_compound_view<compound_type<(allow_prefixes)0> >::tri_comparator::operator() + 7.29% 7.28% reactor-3 scylla [.] dht::ring_position_tri_compare + 6.28% 6.27% reactor-3 scylla [.] dht::tri_compare + 4.11% 3.52% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 4.09% 4.07% reactor-3 scylla [.] sstables::index_consume_entry_context<sstables::index_consumer>::process_state + 3.46% 0.93% reactor-3 scylla [.] sstables::sstable_run::will_introduce_overlapping + 2.53% 2.53% reactor-3 libstdc++.so.6 [.] std::_Rb_tree_increment + 2.45% 2.45% reactor-3 scylla [.] boost::icl::non_empty::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.14% 2.13% reactor-3 scylla [.] boost::icl::exclusive_less<boost::icl::continuous_interval<compatible_ring_position_or_view, std::less> > + 2.07% 2.07% reactor-3 scylla [.] logalloc::region_impl::free + 2.06% 1.91% reactor-3 scylla [.] sstables::index_consumer::consume_entry(sstables::parsed_partition_index_entry&&)::{lambda()#1}::operator()() const::{lambda()#1}::operator() + 2.04% 2.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst+ 1.87% 0.00% reactor-3 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe + 1.86% 0.00% reactor-3 [kernel.kallsyms] [k] do_syscall_64 + 1.39% 1.38% reactor-3 libc.so.6 [.] __memcmp_avx2_movbe + 1.37% 0.92% reactor-3 scylla [.] boost::icl::segmental::join_left<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables:: + 1.34% 1.33% reactor-3 scylla [.] logalloc::region_impl::alloc_small + 1.33% 1.33% reactor-3 scylla [.] seastar::memory::small_pool::add_more_objects + 1.30% 0.35% reactor-3 scylla [.] seastar::reactor::do_run + 1.29% 1.29% reactor-3 scylla [.] seastar::memory::allocate + 1.19% 0.05% reactor-3 libc.so.6 [.] syscall + 1.16% 1.04% reactor-3 scylla [.] boost::icl::interval_base_map<boost::icl::interval_map<compatible_ring_position_or_view, std::unordered_set<seastar::lw_shared_ptr<sstables::sstable>, std::hash<seastar::lw_shared_ptr<sstables::sstable> >, std::equal_to<seastar::lw_shared_ptr<sstables::sst + 1.07% 0.79% reactor-3 scylla [.] sstables::partitioned_sstable_set::insert That shows some significant amount of work for inserting sstables into the interval map and maintaining the sstable run (which sorts fragments by first key and checks for overlapping). The interval map is known for having issues with L0 sstables, as it will have to be replicated almost to every single interval stored by the map, causing terrible space and time complexity. With enough L0 sstables, it can fall into quadratic behavior. This overhead is fixed by not building a new fresh sstable set when recreating the reader, but rather supplying a predicate to sstable set that will filter out staging sstables when creating either a single-key or range scan reader. This could have another benefit over today's approach which may incorrectly consider a staging sstable as non-staging, if the staging sst wasn't included in the current batch for view building. With this improvement, view building was measured to be 3x faster. from INFO 2023-06-16 12:36:40,014 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 963957ms = 50kB/s to INFO 2023-06-16 14:47:12,129 [shard 0] view_update_generator - Processed keyspace1.standard1: 5 sstables in 319899ms = 150kB/s Refs #14089. Fixes #14244. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 22:30:39 -03:00

... 131 132 133 134 135 ...

11801 Commits