scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Kamil Braun	2ebac52d2d	test/pylib: scylla_cluster: return error details from test framework endpoints If an endpoint handler throws an exception, the details of the exception are not returned to the client. Normally this is desirable so that information is not leaked, but in this test framework we do want to return the details to the client so it can log a useful error message. Do it by wrapping every handler into a catch clause that returns the exception message. Also modify a bit how HTTPErrors are rendered so it's easier to discern the actual body of the error from other details (such as the params used to make the request etc.) Before: ``` E test.pylib.rest_client.HTTPError: HTTP error 500: 500 Internal Server Error E E Server got itself in trouble, params None, json None, uri http+unix://api/cluster/before-test/test_stuff ``` After: ``` E test.pylib.rest_client.HTTPError: HTTP error 500, uri: http+unix://api/cluster/before-test/test_stuff, params: None, json: None, body: E Failed to start server at host 127.155.129.1. E Check the log files: E /home/kbraun/dev/scylladb/testlog/test.py.dev.log E /home/kbraun/dev/scylladb/testlog/dev/scylla-1.log ``` Closes #12563 (cherry picked from commit `2f84e820fd`)	2023-02-07 17:04:37 +01:00
Kamil Braun	b536614913	test/pylib: scylla_cluster: release cluster IPs when stopping ScyllaClusterManager When we obtained a new cluster for a test case after the previous test case left a dirty cluster, we would release the old cluster's used IP addresses (`_before_test` function). However, we would not release the last cluster's IP after the last test case. We would run out of IPs with sufficiently many test files or `--repeat` runs. Fix this. Also reorder the operations a bit: stop the cluster (and release its IPs) before freeing up space in the cluster pool (i.e. call `self.cluster.stop()` before `self.clusters.steal()`). This reduces concurrency a bit - fewer Scyllas running at the same time, which is good (the pool size gives a limit on the desired max number of concurrently running clusters). Killing a cluster is quick so it won't make a significant difference for the next guy waiting on the pool. Closes #12564 (cherry picked from commit `3ed3966f13`)	2023-02-07 17:04:19 +01:00
Kamil Braun	85df0fd2b1	test/pylib: scylla_cluster: mark cluster as dirty if it fails to boot If a cluster fails to boot, it saves the exception in `self.start_exception` variable; the exception will be rethrown when a test tries to start using this cluster. As explained in `before_test`: ``` def before_test(self, name) -> None: """Check that the cluster is ready for a test. If there was a start error, throw it here - the server is running when it's added to the pool, which can't be attributed to any specific test, throwing it here would stop a specific test.""" ``` It's arguable whether we should blame some random test for a failure that it didn't cause, but nevertheless, there's a problem here: the `start_exception` will be rethrown and the test will fail, but then the cluster will be simply returned to the pool and the next test will attempt to use it... and so on. Prevent this by marking the cluster as dirty the first time we rethrow the exception. Closes #12560 (cherry picked from commit `147dd73996`)	2023-02-07 17:03:56 +01:00
Avi Kivity	cdf9fe7023	test: disable commitlog O_DSYNC, preallocation Commitlog O_DSYNC is intended to make Raft and schema writes durable in the face of power loss. To make O_DSYNC performant, we preallocate the commitlog segments, so that the commitlog writes only change file data and not file metadata (which would require the filesystem to commit its own log). However, in tests, this causes each ScyllaDB instance to write 384MB of commitlog segments. This overloads the disks and slows everything down. Fix this by disabling O_DSYNC (and therefore preallocation) during the tests. They can't survive power loss, and run with --unsafe-bypass-fsync anyway. Closes #12542 (cherry picked from commit `9029b8dead`)	2023-02-07 17:02:59 +01:00
Tomasz Grabiec	563998b69a	Merge 'raft: improve group 0 reconfiguration failure handling' from Kamil Braun Make it so that failures in `removenode`/`decommission` don't lead to reduced availability, and any leftovers in group 0 can be removed by `removenode`: - In `removenode`, make the node a non-voter before removing it from the token ring. This removes the possibility of having a group 0 voting member which doesn't correspond to a token ring member. We can still be left with a non-voter, but that's doesn't reduce the availability of group 0. - As above but for `decommission`. - Make it possible to remove group 0 members that don't correspond to token ring members from group 0 using `removenode`. - Add an API to query the current group 0 configuration. Fixes #11723. Closes #12502 * github.com:scylladb/scylladb: test: test_topology: test for removing garbage group 0 members test/pylib: move some utility functions to util.py db: system_keyspace: add a virtual table with raft configuration db: system_keyspace: improve system.raft_snapshot_config schema service: storage_service: better error handling in `decommission` service: storage_service: fix indentation in removenode service: storage_service: make `removenode` work for group 0 members which are not token ring members service/raft: raft_group0: perform read_barrier in wait_for_raft service: storage_service: make leaving node a non-voter before removing it from group 0 in decommission/removenode test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove service/raft: raft_group0: link to Raft docs where appropriate service/raft: raft_group0: more logging service/raft: raft_group0: separate function for checking and waiting for Raft	2023-01-17 21:23:15 +01:00
Kamil Braun	d134c458e5	test/pylib: increase timeout when waiting for cluster before test Increase the timeout from default 5 minutes to 10 minutes. Sent as a workaround for #12546 to unblock next promotions. Closes #12547	2023-01-17 21:03:09 +02:00
Kamil Braun	4f1c317bdc	test: test_raft_upgrade: stop servers gracefully in test_recovery_after_majority_loss This test is frequently failing due to a timeout when we try to restart one of the nodes. The shutdown procedure apparently hangs when we try to stop the `hints_manager` service, e.g.: ``` INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Stopped INFO 2023-01-13 03:18:02,946 [shard 0] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Asked to stop INFO 2023-01-13 03:18:02,946 [shard 1] hints_manager - Stopped INFO 2023-01-13 03:22:56,997 [shard 0] hints_manager - Stopped ``` observe the 5 minute delay at the end. There is a known issue about `hints_manager` stop hanging: #8079. Now, for some reason, this is the only test case that is hitting this issue. We don't completely understand why. There is one significant difference between this test case and others: this is the only test case which kills 2 (out of 3) servers in the cluster and then tries to gracefully shutdown the last server. There's a hypothesis that the last server gets stuck trying to send hints to the killed servers. We weren't able to prove/falsify it yet. But if it's true, then this patch will: - unblock next promotions, - give us some important information when we see that the issue stops appearing. In the patch we shutdown all servers gracefully instead of killing them, like we do in the other test cases. Closes #12548	2023-01-17 20:51:09 +02:00
Kamil Braun	5545547d07	test: test_topology: test for removing garbage group 0 members Verify that `removenode` can remove group 0 members which are not token ring members.	2023-01-17 12:28:00 +01:00
Kamil Braun	c959ec455a	test/pylib: move some utility functions to util.py They were used in test_raft_upgrade, but we want to use them in other test files too.	2023-01-17 12:28:00 +01:00
Kamil Braun	a483915c62	db: system_keyspace: add a virtual table with raft configuration Add a new virtual table `system.raft_state` that shows the currently operating Raft configuration for each present group. The schema is the same as `system.raft_snapshot_config` (the latter shows the config from the last snapshot). In the future we plan to add more columns to this table, showing more information (like the current leader and term), hence the generic name. Adding the table requires some plumbing of `sharded<raft_group_registry>&` through function parameters to make it accessible from `register_virtual_tables`, but it's mostly straightforward. Also added some APIs to `raft_group_registry` to list all groups and find a given group (returning `nullptr` if one isn't found, not throwing an exception).	2023-01-17 12:28:00 +01:00
Kamil Braun	1eee349a17	test: test_raft_upgrade: remove test_raft_upgrade_with_node_remove The test would create a scenario where one node was down while the others started the Raft upgrade procedure. The procedure would get stuck, but it was possible to `removenode` the downed node using one of the alive nodes, which would unblock the Raft upgrade procedure. This worked because: 1. the upgrade procedure starts by ensuring that all peers can be contacted, 2. `removenode` starts by removing the node from the token ring. After removing the node from the token ring, the upgrade procedure becomes able to contact all peers (the peers set no longer contains the down node). At the end, after removing the node from the token ring, `removenode` would actually get stuck for a while, waiting for the upgrade procedure to finish before removing the peer from group 0. After the upgrade procedure finished, `removenode` would also finish. (so: first the upgrade procedure waited for removenode, then removenode waited for the upgrade procedure). We want to modify the `removenode` procedure and include a new step before removing the node from the token ring: making the node a non-voter. The purpose is to improve the possible failure scenarios. Previously, if the `removenode` procedure failed after removing the node from the token ring but before removing it from group 0, the cluster would contain a 'garbage' group 0 member which is a voter - reducing group 0's availability. If the node is made a non-voter first, then this failure will not be as big of a problem, because the leftover group 0 member will be a non-voter. However, to correctly perform group 0 operations including making someone a nonvoter, we must first wait for the Raft upgrade procedure to finish (or at least wait until everyone joins group 0). Therefore by including this 'make the node a non-voter' step at the beginning of `removenode`, we make it impossible to remove a token ring member in the middle of the upgrade procedure, on which the test case relied. The test case would get stuck waiting for the `removenode` operation to finish, which would never finish because it would wait for the upgrade procedure to finish, which would not finish because of the dead peer. We remove the test case; it was "lucky" to pass in the first place. We have a dedicated mechanism for handling dead peers during Raft upgrade procedure: the manual Raft group 0 RECOVERY procedure. There are other test cases in this file which are using that procedure.	2023-01-17 12:28:00 +01:00
Nadav Har'El	5bf94ae220	cql: allow disabling of USING TIMESTAMP sanity checking As requested by issue #5619, commit `2150c0f7a2` added a sanity check for USING TIMESTAMP - the number specified in the timestamp must not be more than 3 days into the future (when viewed as a number of microseconds since the epoch). This sanity checking helps avoid some annoying client-side bugs and mis-configurations, but some users genuinely want to use arbitrary or futuristic-looking timestamps and are hindered by this sanity check (which Cassandra doesn't have, by the way). So in this patch we add a new configuration option, restrict_future_timestamp If set to "true", futuristic timestamps (more than 3 days into the future) are forbidden. The "true" setting is the default (as has been the case sinced #5619). Setting this option to "false" will allow using any 64-bit integer as a timestamp, like is allowed Cassanda (and was allowed in Scylla prior to #5619. The error message in the case where a futuristic timestamp is rejected now mentions the configuration paramter that can be used to disable this check (this, and the option's name "restrict_*", is similar to other so-called "safe mode" options). This patch also includes a test, which works in Scylla and Cassandra, with either setting of restrict_future_timestamp, checking the right thing in all these cases (the futuristic timestamp can either be written and read, or can't be written). I used this test to manually verify that the new option works, defaults to "true", and when set to "false" Scylla behaves like Cassandra. Fixes #12527 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12537	2023-01-16 23:18:56 +02:00
Nadav Har'El	feef3f9dda	test/cql-pytest: test more than one restriction on same clustering column Cassandra refuses a request with more than one relation to the same clustering column, for example DELETE FROM tbl WHERE p = ? and c = ? AND c > ? complains that c cannot be restricted by more than one relation if it includes an Equal But it produces different error messages for different operators and even order. Currently, Scylla doesn't consider such requests an error. Whether or not we should be compatible with Cassandra here is discussed in issue #12472. But as long as we do accept these queries, we should be sure we do the right thing: "WHERE c = 1 AND c > 2" should match nothing, "WHERE c = 1 AND c > 0" should match the matches of c = 1, and so on. This patch adds a test for verify that these requests indeed yield correct results. The test is scylla_only because, as explained above, Cassandra doesn't support these requests at all. Refs #12472 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12498	2023-01-16 20:41:16 +02:00
Avi Kivity	0b418fa7cf	cql3, transport, tests: remove "unset" from value type system The CQL binary protocol introduced "unset" values in version 4 of the protocol. Unset values can be bound to variables, which cause certain CQL fragments to be skipped. For example, the fragment `SET a = :var` will not change the value of `a` if `:var` is bound to an unset value. Unsets, however, are very limited in where they can appear. They can only appear at the top-level of an expression, and any computation done with them is invalid. For example, `SET list_column = [3, :var]` is invalid if `:var` is bound to unset. This causes the code to be littered with checks for unset, and there are plenty of tests dedicated to catching unsets. However, a simpler way is possible - prevent the infiltration of unsets at the point of entry (when evaluating a bind variable expression), and introduce guards to check for the few cases where unsets are allowed. This is what this long patch does. It performs the following: (general) 1. unset is removed from the possible values of cql3::raw_value and cql3::raw_value_view. (external->cql3) 2. query_options is fortified with a vector of booleans, unset_bind_variable_vector, where each boolean corresponds to a bind variable index and is true when it is unset. 3. To avoid churn, two compatiblity structs are introduced: cql3::raw_value{,_view}_vector_with_unset, which can be constructed from a std::vector<raw_value{,_view/}>, which is what most callers have. They can also be constructed with explicit unset vectors, for the few cases they are needed. (cql3->variables) 4. query_options::get_value_at() now throws if the requested bind variable is unset. This replaces all the throwing checks in expression evaluation and statement execution, which are removed. 5. A new query_options::is_unset() is added for the users that can tolerate unset; though it is not used directly. 6. A new cql3::unset_operation_guard class guards against unsets. It accepts an expression, and can be queried whether an unset is present. Two conditions are checked: the expression must be a singleton bind variable, and at runtime it must be bound to an unset value. 7. The modification_statement operations are split into two, via two new subclasses of cql3::operation. cql3::operation_no_unset_support ignores unsets completely. cql3::operation_skip_if_unset checks if an operand is unset (luckily all operations have at most one operand that tolerates unset) and applies unset_operation_guard to it. 8. The various sites that accept expressions or operations are modified to check for should_skip_operation(). This are the loops around operations in update_statement and delete_statement, and the checks for unset in attributes (LIMIT and PER PARTITION LIMIT) (tests) 9. Many unset tests are removed. It's now impossible to enter an unset value into the expression evaluation machinery (there's just no unset value), so it's impossible to test for it. 10. Other unset tests now have to be invoked via bind variables, since there's no way to create an unset cql3::expr::constant. 11. Many tests have their exception message match strings relaxed. Since unsets are now checked very early, we don't know the context where they happen. It would be possible to reintroduce it (by adding a format string parameter to cql3::unset_operation_guard), but it seems not to be worth the effort. Usage of unsets is rare, and it is explicit (at least with the Python driver, an unset cannot be introduced by ommission). I tried as an alternative to wrap cql3::raw_value{,_view} (that doesn't recognize unsets) with cql3::maybe_unset_value (that does), but that caused huge amounts of churn, so I abandoned that in favor of the current approach. Closes #12517	2023-01-16 21:10:56 +02:00
Kamil Braun	7510144fba	Merge 'Add replace-node-first-boot option' from Benny Halevy Allow replacing a node given its Host ID rather than its ip address. This series adds a replace_node_first_boot option to db/config and makes use of it in storage_service. The new option takes priority over the legacy replace_address* options. When the latter are used, a deprecation warning is printed. Documentation updated respectively. And a cql unit_test is added. Ref #12277 Closes #12316 * github.com:scylladb/scylladb: docs: document the new replace_node_first_boot option dist/docker: support --replace-node-first-boot db: config: describe replace_address* options as deprecated test: test_topology: test replace using host_id test: pylib: ServerInfo: add host_id storage_service: get rid of get_replace_address storage_service: is_replacing: rely directly on config options storage_service: pass replacement_info to run_replace_ops storage_service: pass replacement_info to booststrap storage_service: join_token_ring: reuse replacement_info.address storage_service: replacement_info: add replace address init: do not allow cfg.replace_node_first_boot of seed node db: config: add replace_node_first_boot option	2023-01-16 15:08:31 +01:00
Botond Dénes	3d9ab1d9eb	Merge 'Get recursive tasks' statuses with task manager api call' from Aleksandra Martyniuk The PR adds an api call allowing to get the statuses of a given task and all its descendants. The parent-child tree is traversed in BFS order and the list of statuses is returned to user. Closes #12317 * github.com:scylladb/scylladb: test: add test checking recursive task status api: get task statuses recursively api: change retrieve_status signature	2023-01-16 11:44:50 +02:00
Benny Halevy	90faeedb77	test: test_topology: test replace using host_id Add test cases exercising the --replace-node-first-boot option by replacing nodes using their host_id rather than ip address. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:09 +02:00
Benny Halevy	7d0d9e28f1	test: pylib: ServerInfo: add host_id Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-13 18:36:07 +02:00
Kamil Braun	bed555d1e5	db: system_keyspace: rename 'raft_config' to 'raft_snapshot_config' Make it clear that the table stores the snapshot configuration, which is not necessarily the currently operating configuration (the last one appended to the log). In the future we plan to have a separate virtual table for showing the currently operating configuration, perhaps we will call it `system.raft_config`.	2023-01-12 16:21:26 +01:00
Botond Dénes	210738c9ce	Merge 'test.py: improve logging' from Kamil Braun Make it easy to see which clusters are operated on by which tests in which build modes and so on. Add some additional logs. These improvements would have saved me a lot of debugging time if I had them last week and we would have https://github.com/scylladb/scylladb/pull/12482 much faster. Closes #12483 * github.com:scylladb/scylladb: test.py: harmonize topology logs with test.py format test/pylib: additional logging during cluster setup test/pylib: prefix cluster/manager logs with the current test name test/pylib: pool: pass args and *kwargs to the build function from get() test.py: include mode in ScyllaClusterManager logs	2023-01-11 16:32:56 +02:00
Aleksandra Martyniuk	fcb3f76e78	test: add test checking recursive task status Rest api test checking whether task manager api returns recursive tasks' statuses properly in BFS order.	2023-01-11 12:34:17 +01:00
Konstantin Osipov	f3440240ee	test.py: harmonize topology logs with test.py format We need millisecond resolution in the log to be able to correlate test log with test.py log and scylla logs. Harmonize the log format for tests which actively manage scylla servers.	2023-01-11 10:09:42 +01:00
Kamil Braun	79712185d5	test/pylib: additional logging during cluster setup This would have saved me a lot of debugging time.	2023-01-11 10:09:42 +01:00
Kamil Braun	4f7e5ee963	test/pylib: prefix cluster/manager logs with the current test name The log file produced by test.py combines logs coming from multiple concurrent test runs. Each test has its own log file as well, but this "global" log file is useful when debugging problems with topology tests, since many events related to managing clusters are stored there. Make the logs easier to read by including information about the test case that's currently performing operations such as adding new servers to clusters and so on. This includes the mode, test run name and the name of the test case. We do this by using custom `Logger` objects (instead of calling `logging.info` etc. which uses the root logger) with `LoggerAdapter`s that include the prefixes. A bit of boilerplate 'plumbing' through function parameters is required but it's mostly straightforward. This doesn't apply to all events, e.g. boost test cases which don't setup a "real" Scylla cluster. These events don't have additional prefixes. Example: ``` 17:41:43.531 INFO> [dev/topology.test_topology.1] Cluster ScyllaCluster(name: 7a414ffc-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(1, 127.40.246.1, 29c4ec73-8912-45ca-ae19-8bfda701a6b5), ScyllaServer(4, 127.40.246.4, 75ae2afe-ff9b-4760-9e19-cd0ed8d052e7), ScyllaServer(7, 127.40.246.7, 67a27df4-be63-4b4c-a70c-aeac0506304f), stopped: ) adding server... 17:41:43.531 INFO> [dev/topology.test_topology.1] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-10... 17:41:43.603 INFO> [dev/topology.test_topology.1] starting server at host 127.40.246.10 in scylla-10... 17:41:43.614 INFO> [dev/topology.test_topology.2] Cluster ScyllaCluster(name: 7a497fce-903c-11ed-bafb-f4d108a9e4a3, running: ScyllaServer(2, 127.40.246.2, f59d3b1d-efbb-4657-b6d5-3fa9e9ef786e), ScyllaServer(5, 127.40.246.5, 9da16633-ce53-4d32-8687-e6b4d27e71eb), ScyllaServer(9, 127.40.246.9, e60c69cd-212d-413b-8678-dfd476d7faf5), stopped: ) adding server... 17:41:43.614 INFO> [dev/topology.test_topology.2] installing Scylla server in /home/kbraun/dev/scylladb/testlog/dev/scylla-11... 17:41:43.670 INFO> [dev/topology.test_topology.2] starting server at host 127.40.246.11 in scylla-11... ```	2023-01-11 10:09:39 +01:00
Kamil Braun	2bda0f9830	test/pylib: pool: pass args and *kwargs to the build function from get() This will be used to specify a custom logger when building new clusters before starting tests, allowing to easily pinpoint which tests are waiting for clusters to be built and what's happening to these particular clusters.	2023-01-10 17:41:54 +01:00
Nadav Har'El	0edb090c67	test/cql-pytest: add simple tests for SELECT DISTINCT This patch adds a few simple functional test for the SELECT DISTINCT feature, and how it interacts with other features especiall GROUP BY. 2 of the 5 new tests are marked xfail, and reproduce one old and one newly-discovered issue: Refs #5361: LIMIT doesn't work when using GROUP BY (the test here uses LIMIT and GROUP BY together with SELECT DISTINCT, so the LIMIT isn't honored). Refs #12479: SELECT DISTINCT doesn't refuse GROUP BY with clustering column. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12480	2023-01-10 13:29:26 +02:00
Michał Radwański	dcab289656	boost/mvcc_test: use failure_injecting_allocation_strategy where it is meant to In test_apply_is_atomic, a basic form of exception testing is used. There is failure_injecting_allocation_strategy, which however is not used for any allocation, since for some reason, `with_allocator(r.allocator()` is used instead of `with_allocator(alloc`. Fix that. Closes #12354	2023-01-10 12:01:36 +01:00
Tomasz Grabiec	ebcd736343	cache: Fix undefined behavior when populating with non-full keys Regression introduced in `23e4c8315`. view_and_holder position_in_partiton::after_key() triggers undefined behavior when the key was not full because the holder is moved, which invalidates the view. Fixes #12367 Closes #12447	2023-01-10 12:51:54 +02:00
Kamil Braun	822410c49b	test/pylib: scylla_cluster: release IPs when cluster is no longer needed With sufficiently many test cases we would eventually run out of IP addresses, because IPs (which are leased from a global host registry) would only be released at the end of an entire test suite. In fact we already hit this during next promotions, causing much pain indeed. Release IPs when a cluster, after being marked dirty, is stopped and thrown away. Closes #12482	2023-01-10 06:59:41 +02:00
Avi Kivity	e71e1dc964	Merge 'tools/scylla-sstable: add lua scripting support' from Botond Dénes Introduce a new "script" operation, which loads a script from the specified path, then feeds the mutation fragment stream to it. The script can then extract, process and present information from the sstable as it wishes. For now only Lua scripts are supported for the simple reason that Lua is easy to write bindings for, it is simple and lightweight and more importantly we already have Lua included in the Scylla binary as it is used as the implementation language for UDF/UDA. We might consider WASM support in the future, but for now we don't have any language support in WASM available. Example: ```lua function new_stats(key) return { partition_key = key, total = 0, partition = 0, static_row = 0, clustering_row = 0, range_tombstone_change = 0, }; end total_stats = new_stats(nil); function inc_stat(stats, field) stats[field] = stats[field] + 1; stats.total = stats.total + 1; total_stats[field] = total_stats[field] + 1; total_stats.total = total_stats.total + 1; end function on_new_sstable(sst) max_partition_stats = new_stats(nil); if sst then current_sst_filename = sst.filename; else current_sst_filename = nil; end end function consume_partition_start(ps) current_partition_stats = new_stats(ps.key); inc_stat(current_partition_stats, "partition"); end function consume_static_row(sr) inc_stat(current_partition_stats, "static_row"); end function consume_clustering_row(cr) inc_stat(current_partition_stats, "clustering_row"); end function consume_range_tombstone_change(crt) inc_stat(current_partition_stats, "range_tombstone_change"); end function consume_partition_end() if current_partition_stats.total > max_partition_stats.total then max_partition_stats = current_partition_stats; end end function on_end_of_sstable() if current_sst_filename then print(string.format("Stats for sstable %s:", current_sst_filename)); else print("Stats for stream:"); end print(string.format("\t%d fragments in %d partitions - %d static rows, %d clustering rows and %d range tombstone changes", total_stats.total, total_stats.partition, total_stats.static_row, total_stats.clustering_row, total_stats.range_tombstone_change)); print(string.format("\tPartition with max number of fragments (%d): %s - %d static rows, %d clustering rows and %d range tombstone changes", max_partition_stats.total, max_partition_stats.partition_key, max_partition_stats.static_row, max_partition_stats.clustering_row, max_partition_stats.range_tombstone_change)); end ``` Running this script wilt yield the following: ``` $ scylla sstable script --script-file fragment-stats.lua --system-schema system_schema.columns /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/me-1-big-Data.db Stats for sstable /var/lib/scylla/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f//me-1-big-Data.db: 397 fragments in 7 partitions - 0 static rows, 362 clustering rows and 28 range tombstone changes Partition with max number of fragments (180): system - 0 static rows, 179 clustering rows and 0 range tombstone changes ``` Fixes: https://github.com/scylladb/scylladb/issues/9679 Closes #11649 * github.com:scylladb/scylladb: tools/scylla-sstable: consume_reader(): improve pause heuristincs test/cql-pytest/test_tools.py: add test for scylla-sstable script tools: add scylla-sstable-scripts directory tools/scylla-sstable: remove custom operation tools/scylla-sstable: add script operation tools/sstable: introduce the Lua sstable consumer dht/i_partitioner.hh: ring_position_ext: add weight() accessor lang/lua: export Scylla <-> lua type conversion methods lang/lua: use correct lib name for string lib lang/lua: fix type in aligned_used_data (meant to be user_data) lang/lua: use lua_State* in Scylla type <-> Lua type conversions tools/sstable_consumer: more consistent method naming tools/scylla-sstable: extract sstable_consumer interface into own header tools/json_writer: add accessor to underlying writer tools/scylla-sstable: fix indentation tools/scylla-sstable: export mutation_fragment_json_writer declaration tools/scylla-sstable: mutation_fragment_json_writer un-implement sstable_consumer tools/scylla-sstable: extract json writing logic from json_dumper tools/scylla-sstable: extract json_writer into its own header tools/scylla-sstable: use json_writer::DataKey() to write all keys tools/scylla-types: fix use-after-free on main lambda captures	2023-01-09 20:54:42 +02:00
Raphael S. Carvalho	05ffb024bb	replica: Kill table::calculate_shard_from_sstable_generation() Inferring shard from generation is long gone. We still use it in some scripts, but that's no longer needed in Scylla, when loading the SSTables, and it also conflicts with ongoing work of UUID-based generations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12476	2023-01-09 20:17:57 +02:00
Nadav Har'El	d6e6820f33	Merge 'Drop support for cql binary protocols versions 1 and 2' from Avi Kivity The CQL binary protocol version 3 was introduced in 2014. All Scylla version support it, and Cassandra versions 2.1 and newer. Versions 1 and 2 have 16-bit collection sizes, while protocol 3 and newer use 32-bit collection sizes. Unfortunately, we implemented support for multiple serialization formats very intrusively, by pushing the format everywhere. This avoids the need to re-serialize (sometimes) but is quite obnoxious. It's also likely to be broken, since it's almost untested and it's too easy to write cql_serialization_format::internal() instead of propagating the client specified value. Since protocols 1 and 2 are obsolete for 9 years, just drop them. It's easy to verify that they are no longer in use on a running system by examining the `system.clients` table before upgrade. Fixes #10607 Closes #12432 * github.com:scylladb/scylladb: treewide: drop cql_serialization_format cql: modification_statement: drop protocol check for LWT transport: drop cql protocol versions 1 and 2	2023-01-09 18:52:41 +02:00
Botond Dénes	1d222220e0	test/cql-pytest/test_tools.py: add test for scylla-sstable script To test the script operation, we use some of the example scripts from the example directory. Namely, dump.lua and slice.lua. These two scripts together have a very good coverage of the entire script API. Testing their functionality therefore also provides a good coverage of the lua bindings. A further advantage is that since both scripts dump output in identical format to that of the data-dump operation, it is trivial to do a comparison against this already tested operation. A targeted test is written for the sstable skip functionality of the consumer API.	2023-01-09 09:46:57 -05:00
Tomasz Grabiec	f97268d8f2	row_cache: Fix violation of the "oldest version are evicted first" when evicting last dummy Consider the following MVCC state of a partition: v2: ==== <7> [entry2] ==== <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] Where === means a continuous range and --- means a discontinuous range. After two LRU items are evicted (entry1 and entry2), we will end up with: v2: ---------------------- <9> ===== <last dummy> v1: ================================ <last dummy> [entry1] This will cause readers to incorrectly think there are no rows before entry <9>, because the range is continuous in v1, and continuity of a snapshot is a union of continuous intervals in all versions. The cursor will see the interval before <9> as continuous and the reader will produce no rows. This is only temporary, because current MVCC merging rules are such that the flag on the latest entry wins, so we'll end up with this once v1 is no longer needed: v2: ---------------------- <9> ===== <last dummy> ...and the reader will go to sstables to fetch the evicted rows before entry <9>, as expected. The bug is in rows_entry::on_evicted(), which treats the last dummy entry in a special way, and doesn't evict it, and doesn't clear the continuity by omission. The situation is not easy to trigger because it requires certain eviction pattern concurrent with multiple reads of the same partition in different versions, so across memtable flushes. Closes #12452	2023-01-09 16:10:52 +02:00
Nadav Har'El	2d845b6244	test/cql-pytest: a test for more than one equality in WHERE Cassandra refuses a request with more than one equality relation to the same column, for example DELETE FROM tbl WHERE partitionKey = ? AND partitionKey = ? It complains that partitionkey cannot be restricted by more than one relation if it includes an Equal Currently, Scylla doesn't consider such requests an error. Whether or not we should be compatible with Cassandra here is discussed in issue #12472. But as long as we do accept this query, we should be sure we do the right thing: "WHERE p = 1 AND p = 2" should match nothing (not the first, or last, value being tested..), and "WHERE p = 1 AND p = 1" should match the matches of p = 1. This patch adds a test for verify that these requests indeed yield correct results. The test is scylla_only because, as explained above, Cassandra doesn't support this feature at all. Refs #12472 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12473	2023-01-09 11:56:39 +02:00
Alejo Sanchez	d632e1aa7a	test/pytest: add missing import, remove unused import Add missed import time and remove unused name import. Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com> Closes #12446	2023-01-08 17:38:46 +02:00
Avi Kivity	5ffe4fee6d	Merge 'Remove legacy half reverse' from Michał Radwański This commit removes consume_in_reverse::legacy_half_reverse, an option once used to indicate that the given key ranges are sorted descending, based on the clustering key of the start of the range, and that the range tombstones inside partition would be sorted (descending, as all the mutation fragments would) according to their end (but range tombstone would still be stored according to their start bound). As it turns out, mutation::consume, when called with legacy_half_reverse option produces invalid fragment stream, one where all the row tombstone changes come after all the clustering rows. This was not an issue, since when constructing results from the query, Scylla would not pass the tombstones to the client, but instead compact data beforehand. In this commit, the consume_in_reverse::legacy_half_reverse is removed, along with all the uses. As for the swap out in mutation_partition.cc in query_mutation and to_data_query_result: The downstream was not prepared to deal with legacy_half_reverse. mutation::consume contains ``` if (reverse == consume_in_reverse::yes) { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } else { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } ``` So why did it work at all? to_data_query_result deals with a single slice. The used consumer (compact_for_query_v2) compacts-away the range tombstone changes, and thus the only difference between the consume_in_reverse::no and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys and the second one was ordered decreasing. This property is maintained if we swap out for the consume_in_reverse::yes format. Refs: #12353 Closes #12453 * github.com:scylladb/scylladb: mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse mutation_partition_view: treat query::partition_slice::option::reversed in to_data_query_result as consume_in_reverse::yes mutation: move consume_in_reverse def to mutation_consumer.hh	2023-01-08 15:42:00 +02:00
Botond Dénes	c4688563e3	sstables: track decompressed buffers Convert decompressed temporary buffers into tracked buffers just before returning them to the upper layer. This ensures these buffers are known to the reader concurrency semaphore and it has an accurate view of the actual memory consumption of reads. Fixes: #12448 Closes #12454	2023-01-08 15:34:28 +02:00
Kamil Braun	b77df84543	test: test_topology: make test_nodes_with_different_smp less hacky The test would use a trick to start a separate Scylla cluster from the one provided originally by the test framework. This is not supported by the test framework and may cause unexpected problems. Change the test to perform regular node operations. Instead of starting a fresh cluster of 3 nodes, we join the first of these nodes to the original framework-provided cluster, then decommission the original nodes, then bootstrap the other 2 fresh nodes. Also add some logging to the test. Refs: #12438, #12442 Closes #12457	2023-01-08 15:33:17 +02:00
Avi Kivity	02c9968e73	Merge 'Add WASM UDF implementation in Rust' from Wojciech Mitros This series adds the implementation and usage of rust wasmtime bindings. The WASM UDFs introduced by this patch are interruptable and use memory allocated using the seastar allocator. This series includes #11102 (the first two commits) because #11102 required disabling wasm UDFs completely. This patch disables them in the middle of the series, and enables them again at the end. After this patch, `libwasmtime.a` can be removed from the toolchain. This patch also removes the workaround for #https://github.com/scylladb/scylladb/issues/9387 but it hasn't been tested with ARM yet - if the ARM test causes issues I'll revert this part of the change. Closes #11351 * github.com:scylladb/scylladb: build: remove references to unused c bindings of wasmtime test: assert that WASM allocations can fail without crashing wasm: limit memory allocated using mmap wasm: add configuration options for instance cache and udf execution test: check that wasmtime functions yield wasm: use the new rust bindings of wasmtime rust: add Wasmtime bindings rust: add build profiles more aligned with ninja modes rust: adjust build according to cxxbridge's recommendations tools: toolchain: dbuild: prepare for sharing cargo cache	2023-01-08 15:31:09 +02:00
Nadav Har'El	f5cda3cfc3	test/cql-pytest: add more tests for "timestamp" column type In issue #3668, a discussion spanning several years theorized that several things are wrong with the "timestamp" type. This patch begins by adding several tests that demonstrate that Scylla is in fact behaving correctly, and mostly identically to Cassandra except one esoteric error handling case. However, after eliminating the red herrings, we are left for the real issue that prompted opening #3668, which is a duplicate of issues #2693 and #2694, and this patch also adds a reproducer for that. The issue is that Cassandra 4 added support for arithmetic expressions on values, and timestamps can be added durations, for example: '2011-02-03 04:05:12.345+0000' - 1d is a valid timestamp - and we don't currently support this syntax. So the new test - which passes on Cassandra 4 and fails on Scylla (or Cassandra 3) is marked xfail. Refs #2693 Refs #2694 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #12436	2023-01-08 15:00:49 +02:00
Wojciech Mitros	996a942e05	test: assert that WASM allocations can fail without crashing The main source of big allocations in the WASM UDF implementation is the WASM Linear Memory. We do not want Scylla to crash even if a memory allocation for the WASM Memory fails, so we assert that an exception is thrown instead. The wasmtime runtime does not actually fail on an allocation failure (assuming the memory allocator does not abort and returns nullptr instead - which our seastar allocator does). What happens then depends on the failed allocation handling of the code that was compiled to WASM. If the original code threw an exception or aborted, the resulting WASM code will trap. To make sure that we can handle the trap, we need to allow wasmtime to handle SIGILL signals, because that what is used to carry information about WASM traps. The new test uses a special WASM Memory allocator that fails after n allocations, and the allocations include both memory growth instructions in WASM, as well as growing memory manually using the wasmtime API. Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>	2023-01-06 14:07:29 +01:00
Wojciech Mitros	f05d612da8	wasm: limit memory allocated using mmap The wasmtime runtime allocates memory for the executable code of the WASM programs using mmap and not the seastar allocator. As a result, the memory that Scylla actually uses becomes not only the memory preallocated for the seastar allocator but the sum of that and the memory allocated for executable codes by the WASM runtime. To keep limiting the memory used by Scylla, we measure how much memory do the WASM programs use and if they use too much, compiled WASM UDFs (modules) that are currently not in use are evicted to make room. To evict a module it is required to evict all instances of this module (the underlying implementation of modules and instances uses shared pointers to the executable code). For this reason, we add reference counts to modules. Each instance using a module is a reference. When an instance is destroyed, a reference is removed. If all references to a module are removed, the executable code for this module is deallocated. The eviction of a module is actually acheved by eviction of all its references. When we want to free memory for a new module we repeatedly evict instances from the wasm_instance_cache using its LRU strategy until some module loses all its instances. This process may not succeed if the instances currently in use (so not in the cache) use too much memory - in this case the query also fails. Otherwise the new module is added to the tracking system. This strategy may evict some instances unnecessarily, but evicting modules should not happen frequently, and any more efficient solution requires an even bigger intervention into the code.	2023-01-06 14:07:29 +01:00
Wojciech Mitros	b8d28a95bf	wasm: add configuration options for instance cache and udf execution Different users may require different limits for their UDFs. This patch allows them to configure the size of their cache of wasm, the maximum size of indivitual instances stored in the cache, the time after which the instances are evicted, the fuel that all wasm UDFs are allowed to consume before yielding (for the control of latency), the fuel that wasm UDFs are allowed to consume in total (to allow performing longer computations in the UDF without detecting an infinite loop) and the hard limit of the size of UDFs that are executed (to avoid large allocations)	2023-01-06 14:07:27 +01:00
Wojciech Mitros	3214f5c2db	test: check that wasmtime functions yield The new implementation for WASM UDFs allows executing the UDFs in pieces. This commit adds a test asserting that the UDF is in fact divided and that each of the execution segments takes no longer than 1ms.	2023-01-06 14:05:53 +01:00
Michał Radwański	1fbf433966	mutation{,_consumer,_partition}: remove consume_in_reverse::legacy_half_reverse This commit removes consume_in_reverse::legacy_half_reverse, an option once used to indicate that the given key ranges are sorted descending, based on the clustering key of the start of the range, and that the range tombstones inside partition would be sorted (descending, as all the mutation fragments would) according to their end (but range tombstone would still be stored according to their start bound). As it turns out, mutation::consume, when called with legacy_half_reverse option produces invalid fragment stream, one where all the row tombstone changes come after all the clustering rows. This was not an issue, since when constructing results from the query, Scylla would not pass the tombstones to the client, but instead compact data beforehand. In this commit, the consume_in_reverse::legacy_half_reverse is removed, along with all the uses. As for the swap out in mutation_partition.cc in query_mutation and to_data_query_result: The downstream was not prepared to deal with legacy_half_reverse. mutation::consume contains ``` if (reverse == consume_in_reverse::yes) { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::yes>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } else { while (!(stop_opt = consume_clustering_fragments<consume_in_reverse::no>(_ptr->_schema, partition, consumer, cookie, is_preemptible::yes))) { co_await yield(); } } ``` So why did it work at all? to_data_query_result deals with a single slice. The used consumer (compact_for_query_v2) compacts-away the range tombstone changes, and thus the only difference between the consume_in_reverse::no and consume_in_reverse::yes was that one was ordered increasing wrt. ckeys and the second one was ordered decreasing. This property is maintained if we swap out for the consume_in_reverse::yes format.	2023-01-05 18:48:55 +01:00
Kamil Braun	09da661eeb	Merge 'raft: replace experimental raft option with dedicated flag' from Gleb Natapov Unlike other experimental feature we want to raft to be opt in even after it leaves experimental mode. For that we need to have a separate option to enable it. The patch adds the binary option "consistent-cluster-management" for that. * 'consistent-cluster-management-flag' of github.com:scylladb/scylla-dev: raft: replace experimental raft option with dedicated flag main: move supervisor notification about group registry start where it actually starts	2023-01-05 15:21:35 +01:00
Kamil Braun	4268b1bbc2	Merge 'raft: raft_group0, register RPC verbs on all shards' from Gusev Petr raft_group0 used to register RPC verbs only on shard 0. This worked on clusters with the same --smp setting on all nodes, since RPCs in this case are processed on the same shard as the calling code, and raft_group0 methods only run on shard 0. A new test test_nodes_with_different_smp was added to identify the problem. Since --smp can only be specified via the command line, a corresponding parameter was added to the ManagerClient.server_add method. It allows to override the default parameters set by the SCYLLA_CMDLINE_OPTIONS variable by changing, adding or deleting individual items. Fixes: #12252 Closes #12374 * github.com:scylladb/scylladb: raft: raft_group0, register RPC verbs on all shards raft: raft_append_entries, copy entries to the target shard test.py, allow to specify the node's command line in test	2023-01-04 11:11:21 +01:00
Michał Jadwiszczak	83bb77b8bb	test/boost/cql_query_test: enable `parallelized_aggregation` Run tests for parallelized aggregation with `enable_parallelized_aggregation` set always to true, so the tests work even if the default value of the option is false. Closes #12409	2023-01-04 10:11:25 +02:00
Avi Kivity	2739ac66ed	treewide: drop cql_serialization_format Now that we don't accept cql protocol version 1 or 2, we can drop cql_serialization format everywhere, except when in the IDL (since it's part of the inter-node protocol). A few functions had duplicate versions, one with and one without a cql_serialization_format parameter. They are deduplicated. Care is taken that `partition_slice`, which communicates the cql_serialization_format across nodes, still presents a valid cql_serialization_format to other nodes when transmitting itself and rejects protocol 1 and 2 serialization\ format when receiving. The IDL is unchanged. One test checking the 16-bit serialization format is removed.	2023-01-03 19:54:13 +02:00

1 2 3 4 5 ...

4102 Commits