scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Botond Dénes	1b7b3a81e6	sstables: entry_descriptor::make_descriptor(): add overload with provided ks/cf Not necessitating these to be extracted from the sstable dir path. This practically allows for la/mx sstables at non-standard paths to be opened. This will be used by the `scylla-sstable` tool which wants to be flexible about where the sstables it opens are located.	2021-10-12 11:43:23 +03:00
Mikołaj Sielużycki	235c38e78f	sstables, gdb: Retire usage of sstable_tracker sstables_manager superseeds previous implementation of sstables_tracker for tracking lifetime of the tables. Update scylla-gdb.py to use sstables_manager in a backwards compatible way, as sstables_manager is not available in Scylla Enterprise 2020.1. Add explicit test for "scylla sstables" command, as previously only "scylla active-sstables" was tested. Closes #9439	2021-10-07 14:40:47 +02:00
Piotr Sarna	59bd25d1ea	transport: respond with overloaded exception during shedding This commit makes shedding always respond - with overloaded exception, instead of ignoring the request. Fixes #9442 Closes #9443	2021-10-07 15:38:40 +03:00
Nadav Har'El	d1505762df	cql-pytest: add to README an example of repeating a test pytest supports - if the "repeat" extension is installed - a convenient and efficient way to repeat the same test (or all of them) multiple times. Since it's very useful, let's document it in cql-pytest/README.md. By the way, our test.py also has a "--repeat" option, but it can only run all cql-pytest tests, not just repeat a single small test, and it is also slower (and arguably, different) because it restarts Scylla instead of running a test 100 times on the same Scylla. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211007122146.624210-1-nyh@scylladb.com>	2021-10-07 15:30:41 +03:00
Michael Livshin	e88891a8af	avoid race between compaction and table stop Also add a debug-only compaction-manager-side assertion that tests that no new compaction tasks were submitted for a table that is being removed (debug-only because not constant-time). Fixes #9448. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>	2021-10-07 14:36:39 +03:00
Kamil Braun	96f18c4bb0	test: test_sstable_reversing_reader_random_schema: fix the workaround for #9352 The test generates random mutations and eliminates mutations whose keys tokenize to 0, in particular it eliminates mutations with empty partition keys (which should not end up in sstables). However it would do that after using the randomly generated mutations to create their reversed versions. So the reversed versions of mutations with empty partition keys would stay. Fix by placing the workaround earlier in the test. Closes #9447	2021-10-07 14:01:43 +03:00
Raphael S. Carvalho	59693e6da3	compaction_manager: make rewrite_sstables() bail out when asked to stop rewrite_sstables() can be asked to stop either on shutdown or on an user-triggered comapction which forces all ongoing compaction to stop, like scrub. turns out we weren't actually bailing out from do_until() when task cannot proceed. So rewrite_sstables() potentially runs into an infinite loop which in turn causes shutdown or something else waiting on it to hang forever. found this while auditting code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211005233601.155442-1-raphaelsc@scylladb.com>	2021-10-07 10:46:22 +03:00
Benny Halevy	90fd4d5ed7	test: sstable_conforms_to_mutation_source_test: test_sstable_reversing_reader_random_schema: auto-close reader on exception I stumbled upon this failure in dev mode: ``` test/boost/sstable_conforms_to_mutation_source_test.cc(0): Entering test case "test_sstable_reversing_reader_random_schema" sstable_conforms_to_mutation_source_test: ./seastar/src/core/fstream.cc:205: virtual seastar::file_data_source_impl::~file_data_source_impl(): Assertion `_reads_in_progress == 0' failed. Aborting on shard 0. ``` Since dev mode has no debug symbols I can't decode the stack trace so I'm not 100% sure about the root cause and I couldn't reproduce it in release or debug modes yet. One vulnerability in the current code is that r1 won't be closed if an exception is thrown before r1 and r2 are moved to `compare_readers` so this change adds a deferred close of r1 in this case. Test: sstable_conforms_to_mutation_source_test(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211006144009.696412-1-bhalevy@scylladb.com>	2021-10-06 17:53:49 +03:00
Takuya ASADA	3b798afc1e	scylla_io_setup: handle nr_disks on GCP correctly nr_disks is int, should not be string. Fixes #9429 Closes #9430	2021-10-06 12:31:38 +03:00
Nadav Har'El	0f8d3ea459	cql-pytest: translate Cassandra's tests for ORDER BY This is a translation of Cassandra's CQL unit test source file validation/operations/SelectOrderByTest.java into our our cql-pytest framework. This test file includes 17 tests for various features and corners of SELECT's "ORDER BY" feature. All these tests pass on Cassandra, but three fail on Scylla and are marked as xfail: One previously-unknown Scylla bug: Refs #9435: SELECT with IN, ORDER BY and function call does not obey the ORDER BY And two new reproducers for already known bugs: Refs #2247: ORDER BY should allow skipping equality-restricted clustering columns Refs #7751: Allow selecting map values and set elements, like in Cassandra 4.0 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211005174140.571056-1-nyh@scylladb.com>	2021-10-06 12:31:38 +03:00
Avi Kivity	0ea79559a6	Merge 'IDL: support generating boilerplate code for RPC verbs' from Pavel Solodovnikov Introduce new syntax in IDL compiler to allow generating registration/sending code for RPC verbs: ``` verb [[attr1, attr2...] my_verb (args...) -> return_type; ``` `my_verb` RPC verb declaration corresponds to the `netw::messaging_verb::MY_VERB` enumeration value to identify the new RPC verb. For a given `idl_module.idl.hh` file, a registrator class named `idl_module_rpc_verbs` will be created if there are any RPC verbs registered within the IDL module file. These are the methods being created for each RPC verb: ``` static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&); static future<> unregister_my_verb(netw::messaging_service* ms); static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...); ``` Each method accepts a pointer to an instance of `messaging_service` object, which contains the underlying seastar RPC protocol implementation, that is used to register verbs and pass messages. There is also a method to unregister all verbs at once: ``` static future<> unregister(netw::messaging_service* ms); ``` The following attributes are supported when declaring an RPC verb in the IDL: * `[[with_client_info]]` - the handler will contain a const reference to an `rpc::client_info` as the first argument. * `[[with_timeout]]` - an additional `time_point` parameter is supplied to the handler function and `send` method uses `send_message__timeout` variant of internal function to actually send the message. * `[[one_way]]` - the handler function is annotated by `future<rpc::no_wait_type>` return type to designate that a client doesn't need to wait for an answer. The `-> return_type` clause is optional for two-way messages. If omitted, the return type is set to be `future<>`. For one-way verbs, the use of return clause is prohibited and the signature of `send` function always returns `future<>`. No existing code is affected. Ref: #1456 Closes #9359 github.com:scylladb/scylla: idl: support generating boilerplate code for RPC verbs idl: allow specifying multiple attributes in the grammar message: messaging_service: extract RPC protocol details and helpers into a separate header	2021-10-05 18:05:24 +03:00
Tzach Livyatan	bd87c7d362	Update docker-hub text Mention aarch64 support Closes #9436	2021-10-05 17:35:02 +03:00
Raphael S. Carvalho	342bfbd65a	compaction: Make major compaction on keyspace resilient if low on space Let's major compact the smallest tables first, increasing chances of success if low on disk space. parallel_for_each() didn't have any effect on space requirement as compaction_manager serializes major compaction in a shard. As parallel_for_each() is no longer used, find_column_family() is now used before each compact_all_sstables() to avoid a race with table drop. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211005135257.31931-1-raphaelsc@scylladb.com>	2021-10-05 17:04:34 +03:00
Nadav Har'El	77bd4afda7	test/alternator: avoid client-side validation Ever since we started testing Alternator with tests written in Python and using Amazon's "boto3" library, one limitation kept annoying us: Boto3 verifies the validity of the request parameters before passing them on to the server. It verifies that mandatory parameters are not missing, that parameters have the right types, and sometimes even the right ranges - all in the library before ever sending the request. This meant that in many cases, we couldn't get good test coverage for Alternator's server-side handling of wrong parameters. As it turns out, it is trivial to tell boto3 to not do its client-side request validation, with the `parameter_validation=False` config flag. We just never noticed that such a flag existed :-) So this patch adds this flag. It then fixes a few tests which expected ParameterValidationError - this error is the client-side validation failure, but should now be replaced by checking the server-side error. The patch also adds a couple of invalid parameter checks that we couldn't do before because of boto3's eagerness to check them on the client side. We can add a lot more of these error tests in the future, now that we got rid of client-side validation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211005095514.537226-1-nyh@scylladb.com>	2021-10-05 13:26:51 +02:00
Nadav Har'El	6dee86eade	test/alternator: another test for adding a GSI to an existing table This patch adds yet another test for Alternator's unimplemented feature of adding a GSI to an already existing table (issue #5022), but this test is for a very specific corner case - tables which contain string attributes with an empty value - the corner case described in issue #9424: DynamoDB used to forbid any string attributes from being set to an empty string, but this changed in May 2020, and since then empty strings are allowed - but NOT as keys. So although it is legal to set a string attribute to an empty string, if this table has a GSI whose key is that specific attribute, the update command is refused. We already had a test for this - test_gsi_empty_value. However, the case in this patch is the case where a GSI is added to a table after the table already has data. In this case (as this test demonstrates), we are supposed to drop the items which have the empty string key from the GSI. Even when #5022 (the ability to add GSIs to existing tables) will be done, this test will continue to fail. The unique problem of this test is that Scylla's materialized views do allow empty strings as clustering keys (right now) and even partition keys (after #9375 will be solved), while we don't want them to enter the GSI. We will probably need to add to the view's filter, which right now contains (as required) "x IS NOT NULL" also the filter "x != ''" (when x's type is a string or binary) so that items with empty-string keys will be dropped. Refs #5022 Refs #9375 Refs #9424 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211003170636.477582-1-nyh@scylladb.com>	2021-10-05 13:26:43 +02:00
Nadav Har'El	b136104298	alternator/test: test for invalid numeric values DynamoDB has a rather baroque definition of numbers, and in particular it does not allow numeric attributes to be set to infinity or NaN. Although I did check invalid numbers in the past, manually, I was never able to write a unit test for this in the past - because the boto3 library catches such errors on the client side, and prevents the test from sending broken requests to the server. So in this patch, I finally came up with a solution - a context manager client_no_transform() which yields a client which does NOT do any transformation or validation on the request's parameters, allowing us to use boto3 to create improper requests - and test the server's handling of them. The test in this patch passes - it did not discover a new bug, but it is a useful regression test and the client_no_transform() trick can be used in more error-case tests which until now we were unable to write. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211004161809.520236-1-nyh@scylladb.com>	2021-10-05 13:13:45 +02:00
Avi Kivity	2d25705db0	cql3: deinline non-trivial methods in selection.hh This allows us to forward-declare raw_selector, which in turn reduces indirect inclusions of expression.hh from 147 to 58, reducing rebuilds when anything in that area changes. Includes that were lost due to the change are restored in individual translation units. Closes #9434	2021-10-05 12:58:55 +02:00
Avi Kivity	d3f8148807	utils: untie rjson.hh from base64.hh base64.hh pulls in the huge rjson.hh, so if someone just wants a base64 codec they have to pull in the entire rapidjson library. Move the json related parts of base64.hh to rjson.hh and adjust includes and namespaces. In practice it doesn't make much difference, as all users of base64 appear to want json too. But it's cleaner not to mix the two. Closes #9433	2021-10-05 12:57:54 +02:00
Avi Kivity	3a67c661d4	Merge "Improve parallelizm of mutation source tests" from Pavel E " There's a run_mutation_source_tests lib helper that runs a bunch of tests sequentially. The problem is that it does 4 different flavors of it each being a certain decoration over provided reader. This amplification makes some test cases run enormous amount of time without any chance for parallelizm. The simplest way to help running those cases in parallel is to teach the slowest cases to run different flavors of mutation source tests in dedicated cases. This patch makes it so. The resulting timings are dev debug sequential run: 2m1s 53m50s --parallel-cases (+ this patch): 1m3s 31m15s tests: unit(dev, debug) " * 'br-parallel-mutation-source-tests' of https://github.com/xemul/scylla: test: Split multishard combining reader case test: Split database test case test: Split run_mutation_source_tests	2021-10-05 12:22:52 +03:00
Tomasz Grabiec	17430795e8	Merge "test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot`" from Kamil It's possible that the server drops the snapshot in the same iteration of `io_fiber` loop as it tries to send it (the sending of messages happens after snapshot dropping). Handle this case by throwing an exception. As a preparation we also fix the code in `server_impl::send_snapshot` so it works correctly when `rpc::send_snapshot` throws or returns a ready future. Refs #9407. * kbr/snapshot-handle-errors: test: raft: randomized_nemesis_test: remove an obsolete comment test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot` raft: server: handle `rpc::send_snapshot` returning instantly	2021-10-05 11:19:14 +02:00
Kamil Braun	c9a7778497	test: raft: randomized_nemesis_test: remove an obsolete comment	2021-10-05 11:04:11 +02:00
Kamil Braun	961f5a904c	test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot` It's possible that the server drops the snapshot in the same iteration of `io_fiber` loop as it tries to send it (the sending of messages happens after snapshot dropping). Handle this case. Refs #9407.	2021-10-05 11:04:11 +02:00
Kamil Braun	36f3e26374	raft: server: handle `rpc::send_snapshot` returning instantly If `rpc::send_snapshot` returned immediately with a ready future, or if it threw, the code in `server_impl::send_snapshot` would not update `_snapshot_transfers` correctly. The code assumed that the continuation attached to `rpc::send_snapshot` (with `then_wrapped`) was executed after `_snapshot_transfer` below the `rpc::send_snapshot` call was updated. That would not necessarily be true (the continuation may even not have been executed at all if `rpc::send_snapshot` threw). Fix that by wrapping the `rpc::send_snapshot` call into a continuation attached to `later()`. Originally authored by Gleb <gleb@scylladb.com>, I added a comment.	2021-10-05 11:04:11 +02:00
Pavel Emelyanov	b742e6cbb6	test: Split multishard combining reader case All the cases in this test also run mutation source tests and the case with single-fragment buffer takes times more time to execute than the others. Splitting this single case so that it runs mutation source tests flavours in different cases improves the test parallelizm. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-05 11:57:02 +03:00
Pavel Emelyanov	30075094ac	test: Split database test case The test_database_with_data_in_sstables_is_a_mutation_source case runs the mutation source tests in one go. The problem is that on each step a whole new ks:cf is created which takes the majority of the tests time. In the end of the day this case is the slowest one in the suite being up to two times longer (depending on mode) than the #2 on this list. This patch splits the case into 4 so that each mutation source flavor is run in separate case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-05 11:53:18 +03:00
Pavel Emelyanov	1e09a2c925	test: Split run_mutation_source_tests There are 4 flavours of mutation source tests that are all ran sequentially -- plain, reversed and upgrade/downgrade ones that check v1<->v2 conversions. This patch splits them all into individual calls so that some tests may want to have dedicated cases for each. "By default" they are all run as they were. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-10-05 11:51:43 +03:00
Pavel Emelyanov	4b4ce015aa	system-keyspace: Keep UUID value when saving The set_local_host_id() accepts UUID references and starts to save it in local keyspace and in all shards' local cache. Before it was coroutinized the UUID was copied on captures and survived, after it it remains references. The problem is that callers pass local variables as arguments that go away "really soon". Fix it to accept UUID as value, it's short enough for safe and painless copy. fixes: #9425 tests: dtest.ReplaceAddress_rbo_enabled.replace_node_diff_ip(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211004145421.32137-1-xemul@scylladb.com>	2021-10-04 18:21:44 +03:00
Tomasz Grabiec	cd9b4d95fc	Merge "test: raft: randomized_nemesis_test: better liveness check at the end of generator test" from Kamil The previous check would find a leader once and assume that it does not change, and that the first attempt at sending a request to this leader succeeds. In reality the leader may change at the end of the test (e.g. it may be in the middle of stepping down when we find it) and in general it may take some time for the cluster to stabilize. The new check tries a few times to find a leader and perform a request - until a time limit is reached. * kbr/nemesis-liveness-check: test: raft: randomized_nemesis_test: better liveness check at the end of generator test test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader`	2021-10-04 16:05:37 +02:00
Kamil Braun	17e771c5f5	test: raft: randomized_nemesis_test: better liveness check at the end of generator test The previous check would find a leader once and assume that it does not change, and that the first attempt at sending a request to this leader succeeds. In reality the leader may change at the end of the test (e.g. it may be in the middle of stepping down when we find it) and in general it may take some time for the cluster to stabilize. The new check tries a few times to find a leader and perform a request - until a time limit is reached. The commit also removes an incorrect assertion inside in `wait_for_leader`.	2021-10-04 15:57:54 +02:00
Kamil Braun	478a58e86d	test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader` To be used in the next commit, where we call `wait_for_leader` in a loop with the same deadline `time_point`.	2021-10-04 15:56:54 +02:00
Tomasz Grabiec	e89b9799b8	Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun Until now reversed queries were implemented inside `querier::consume_page` (more precisely, inside the free function `consume_page` used by `querier::consume_page`) by wrapping the passed-in reader into `make_reversing_reader` and then consuming fragments from the resulting reversed reader. The first couple of commits change that by pushing the reversing down below the `make_combined_reader` call in `table::query`. This allows working on improving reversing for memtables independently from reversing for sstables. We then extend the `index_reader` with functions that allow reading the promoted index in reverse. We introduce `partition_reversing_data_source`, which wraps an sstable data file and returns data buffers with contents of a single chosen partition as if the rows were stored in reverse order. We use the reversing source and the extended index reader in `mx_sstable_mutation_reader` to implement efficient (at least in theory) reversed single-partition reads. The patchset disables cache for reversed reads. Fast-forwarding is not supported in the mx reader for reversed queries at this point. Details in commit messages. Read the commits in topological order for best review experience. Refs: #9134 (not saying "Fixes" because it's only for single-partition queries without forwarding) Closes #9281 * github.com:scylladb/scylla: table: add option to automatically bypass cache for reversed queries test: reverse sstable reader with random schema and random mutations sstables: mx: implement reversed single-partition reads sstables: mx: introduce partition_reversing_data_source sstables: index_reader: add support for iterating over clustering ranges in reverse clustering_key_filter: clustering_key_filter_ranges owning constructor flat_mutation_reader: mention reversed schema in make_reversing_reader docstring clustering_key_filter: document clustering_key_filter_ranges::get_ranges	2021-10-04 15:37:34 +02:00
Kamil Braun	703aed3277	table: add option to automatically bypass cache for reversed queries Currently the new reversing sstable algorithms do not support fast forwarding and the cache does not yet handle reversed results. This forced us to disable the cache for reversed queries if we want to guarantee bounded memory. We introduce an option that does this automatically (without specifying `bypass cache` in the query) and turn it on by default. If the user decides that they prefer to keep the cache at the cost of fetching entire partitions into memory (which may be viable if their partitions are small) during reversed queries, the option can be turned off. It is live-updateable.	2021-10-04 15:24:12 +02:00
Kamil Braun	9bf6be5509	test: reverse sstable reader with random schema and random mutations The test generates a random set of mutations and creates two readers: - one by reversing the mutations, creating an sstable out of the result, and querying it in reverse, - one by creating an sstable directly from the mutations and querying it in forward mode. It checks that the readers give equal results. The test already managed to find a bug where offsets returned by the sstable index were interpreted incorrectly as absolute instead of relative. It also helped find another bug unrelated to reversing (#9352). Surprisingly few tests use the random schema and random mutation utilities which seem to be quite powerful.	2021-10-04 15:24:12 +02:00
Kamil Braun	27238eaa0f	sstables: mx: implement reversed single-partition reads We use partition_reversing_data_source and the new `index_reader` methods to implement single-partition reads in `mx_sstable_mutation_reader`. The parsing logic does not need to change: the buffers returned by the source already contain rows in reversed clustering order. Some changes were required in `mp_row_consumer_m` which processes the parsed rows and emits appropriate mutation fragments. The consumer uses `mutation_fragment_filter` underneath to decide whether a fragment should be ignored or not (e.g. the parsed fragment may come from outside the requested clustering range), among other things. Previously `mutation_fragment_filter` was provided a `partition_slice`. If the slice was reversed, the filter would use `clustering_key_filter_ranges::get_ranges` to obtain the clustering ranges from the slice in unreversed order (they were reversed in the slice) since we didn't perform any reversing in the reader. Now the reader provides the ranges directly instead of the slice; furthermore, the ranges are provided in native-reversed format (the order of ranges is reversed and the ranges themselves are also reversed), and the schema provided to the filter is also reversed. Thus to the filter everything appears as if it was used during a non-reversed query but on a table with reversed schema, which works correctly given the fact that the reader is feeding parsed rows into the consumer in reversed order. During reversed queries the reader uses alternative logic for skipping to a later range (or, speaking in non-reversed terms, to an earlier range), which happens in `advance_context`. It asks the index to advance its upper bound in reverse so that the reversing_data_source notices the change of the index end position and returns following buffers with rows from the new range. There is a slight difference in behavior of the reader from `mp_row_consumer_m`'s point of view. For non-reversed reads, after the consumer obtains the beginning of a row (`consume_row_start`) - which contains the row's position but not the columns - and tells the reader that the row won't be emitted because we need to skip to a later range, the reader would tell the data source (the 'context') immediately to skip to a later range by calling `skip_to`. This caused the source not to return the rest of the row, and the rest of the row would not be fed to the consumer (`consume_row_end`). However, for reversed reads, the data source performs skipping 'on its own', after it notices that the index end position has changed. This may happen 'too late', causing the rest of the row to be returned anyway. We are prepared for this situation inside `mp_row_consumer` by consulting the mutation fragment filter again when the rest of the row arrives. Fast forwarding is not supported at this point, which is fine given that the cache is disabled for reversed queries for now (and the cache is the only user of fast forwarding). The `partition_slice` provided by callers is provided in 'half-reversed' format for reversed queries, where the order of clustering ranges is reversed, but the ranges themselves are not. This means we need to modify the slice sometimes: for non-single-partition queries the mx reader must use a non-reversed slice, and for single-partition queries the mx reader must use a native-reversed slice (where the clustering ranges themselves are reversed as well). The modified slice must be stored somewhere; we store it inside the mx reader itself so we don't need to allocate more intermediate readers at the call sites. This causes the interface of `mx::make_reader` to be a bit weird: for non-single-partition queries where the provided slice is reversed the reader will actually return a non-reversed stream of fragments, telling the user to reverse the stream on their own. The interface has been documented in detail with appropriate comments.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	64e703bb54	sstables: mx: introduce partition_reversing_data_source This patch adds an implementation of a data source that wraps an sstable data file and returns data buffers with contents of one partition in the sstable as if the rows of the partition were present in a reversed order. In other words, to the user of the source the partition appears to be reversed. We shall call this an 'intermediary' data source. As part of the interface of the intermediary source the user is also given read access to the source's current position over the data file, and the constructor of the source takes a reference to `index_reader`. This is necessary because the index operates directly on data file offsets and we want the user to be able to use the index to skip sequences of rows. In order to ask the source to skip a sequence of rows - e.g. when jumping between clustering ranges - the user must advance the index' upper bound in reverse (to an earlier position). The source will then notice that the end position of the index has changed and take appropriate action. An alternative would be to translate the data positions of `index_reader` to 'reversed positions' of the intermediary and then use `skip_to` for skipping, as we do for forward reads. However this solution would introduce more complexity to `index_reader` and the intermediary source. One reason for the complexity in the input stream is that we would have two kinds of skips: a single row skip, and a skip to a clustering range. We know the offset of the next row, so we could check that to differentiate them. We would also need to add an information about the position of first clustering row and end of the last one in the index_reader. Skipping by checking the index seems to be overall simpler. For simplicity, the intermediary stream always starts with parsing the partition header and (if present) the static row, and returning the corresponding bytes as a result of the first read. After partition header and static row we must find the last row entry of the requested range. If the range ends before the partition end (i.e. there are more row entries after the range) we can use the 'previous unfiltered size' of the row following the range; otherwise we must scan the last promoted index block and take its last row. After finding the data range of the last row, we parse rows consecutively in reversed order. We must parse the rows partially to learn their lengths and the positions of previous rows. We're using similar constructs as in the sstable parser, but it only contains a small part of the parsing coroutine and doesn't perform any correctness checks. The parser for rows still turned out rather big mostly because we can't always deduce the size of the clustering blocks without reading the block header. The parser allows reading rows while skipping their bodies also in non-reversed order, which we are making use of while reading the last promoted index block. The intermediary data source has one more utility: reversing range tombstones. When we read a tombstone bound/boundary, we modify the data buffer so that the resulting bound/boundary has the reversed kind (so we don't read ends before starts) and the boundaries have their before/after timestamps swapped.	2021-10-04 15:24:12 +02:00
Wojciech Mitros	8385f3eb21	sstables: index_reader: add support for iterating over clustering ranges in reverse In the sstable reader, we iterate over clustering ranges using the index_reader, which normally only accepts advancing to increasing positions. In this patch we add methods for advancing the index reader in reverse. To simplify our job we restrict our attention to a single implementation of the promoted index block cursor: `bsearch_clustered_cursor`. The `index_reader` methods for advancing in reverse will thus assume that this implementation is used. The assumption is correct given that we're working only with sstables of versions >= mc, which is indeed the intended use case. We add some documentation in appropriate places to make this obvious. We extend `bsearch_clustered_cursor` with two methods: `advance_past(pos)`, which advances the cursor to the first block after `pos` (or to the end if there is no such block), and `last_block_offset()`, which returns the data file offset of the first row from the last promoted index block. To efficiently find the position in the data file of the last row of the partition (which we need when performing a reversed query) the sstable reader may need to read the span of the entire last promoted index block in the data file. To learn where the block starts it can use `index_reader::last_block_offset()`, which is implemented in terms of `bsearch_clustered_cursor::last_block_offset()`. When performing a single partition read in forward order, the reader asks the index to position its lower bound at the start of the partition and its upper bound after the end of the slice. It starts by reading the first range. After exhausting a range it jumps to the next one by asking the index to advance the lower bound. For reverse single partition reads we'll take a similar approach: the initial bound positions are as in the forward case. However, we start with the last range and after exhausting a range we want to jump to a previous one; we will do it by advancing the upper bound in reverse (i.e. moving it closer to the beginning of the partition). For this we introduce the `index_reader::advance_reverse` function.	2021-10-04 15:24:12 +02:00
Avi Kivity	3cb865103d	Update seastar submodule * seastar 0ba6c36cc3...994b4b5a0c (2): > Merge "Improve smp::invoke_on_others" from Pavel E > file: keep hint pointer alive when calling fcntl()	2021-10-04 15:36:45 +03:00
Avi Kivity	148a12f3da	Merge "Keep storage_service less aware of cdc internals" from Pavel E " The storage_service is involved in the cdc_generation_service guts more than needed. - the bool _for_testing bit is cdc-only - there's API-only cdc_generation_service getter - cdc_g._s. startup code partially sits in s._s. one This patch cleans most of the above leaving only the startup _cdc_gen_id on board. tests: unit(dev) refs: #2795 " * 'br-storage-service-vs-cdc-2' of https://github.com/xemul/scylla: api: Use local sharded<cdc::generation_service> reference main: Push cdc::generation_service via API storage_service: Ditch for_testing boolean cdc: Replace db::config with generation_service::config cdc: Drop db::config from description_generator cdc: Remove all arguments from maybe_rewrite_streams_descriptions cdc: Move maybe_rewrite_streams_descriptions into after_join cdc: Squash two methods into one cdc: Turn make_new_cdc_generation a service method cdc: Remove ring-delay arg from make_new_cdc_generation cdc: Keep database reference on generation_service	2021-10-04 14:56:05 +03:00
Piotr Dulikowski	6093c2378b	hints: assign _last_written_rp in ep manager's move constructor The end_point_hints_manager's field _last_written_rp is initialized in its regular constructor, but is not copied in the move constructor. Because the move constructor is always involved when creating a new endpoint manager, the _last_written_rp field is effectively always initialized with the zero-argument constructor, and is set to the zero value. This can cause the following erroneous situation to occur: - Node A accumulates hints towards B. - Sync point is created at A. It will be used later to wait for currently accumulated hints. - Node A is restarted. The endpoint manager A->B is created which has bogus value in the _last_written_rp (it is set to zero). - Node A replays its hints but does not write any new ones. - A hint flush occurs. If there are no hint segments on disk after flush, the endpoint manager sets its last sent position to the last written position, which is by design. However, the last written position has incorrect value, so the last sent position also becomes incorrect and too low. - Try to wait for the sync point created earlier. The sync point waiting mechanism waits until last sent hint position reaches or goes past the position encoded in the sync point, but it will not happen because the last sent position is incorrect. The above bug can be (sometimes) reproduced in hintedhandoff_sync_point_api_test dtest. Now, the _last_written_rp field is properly initialized in the move constructor, which prevents the bug described above. Fixes: #9320 Closes #9426	2021-10-04 13:21:34 +02:00
Kamil Braun	b2f33b3e0b	test: raft: randomized_nemesis_test: abort environment before ticker We must abort the environment before the ticker as the environment may require time to keep advancing during abort in order for all operations to finish, e.g. operations that can finish only due to timeout. Currently such operations may cause the test to hang indefinitely at the end. The test requires a small modification to ensure that `delivery_queue::push` is not called after the queue was aborted. Message-Id: <20210930143539.157727-1-kbraun@scylladb.com>	2021-10-04 12:31:26 +02:00
Avi Kivity	1bac93e075	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-10-04 13:09:31 +03:00
Avi Kivity	93b765f655	scripts/pull_github_pr.sh: don't guess git remote name The script assumes the remote name is "origin", a fair assumption, but not universally true. Read it from configuration instead of guessing it. Closes #9423	2021-10-04 12:32:39 +03:00
Nadav Har'El	414b672e22	test/alternator: verify that empty-string keys are NOT allowed Since May 2020 empty strings are allowed in DynamoDB as attribute values (see announcment in [1]). However, they are still not allowed as keys. We had tests that they are not allowed in keys of LSI or GSI, but missed tests that they are not allowed as keys (partition or sort key) of base tables. This patch add these missing tests. These tests pass - we already had code that checked for empty keys and generated an appropriate error. Note that for compatibility with DynamoDB, Alternator will forbid empty strings as keys even though Scylla does support this possibility (Scylla always supported empty strings as clustering key, and empty partition keys will become possible with issue #9352). [1] https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/ Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211003122842.471001-1-nyh@scylladb.com>	2021-10-04 08:40:43 +02:00
Botond Dénes	61e7d3de90	Merge 'Cleanup compaction_stop_exception' from Benny Halevy The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception` and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown or api event vs. compaction aborting due to scrub validation error. While at it, cleanup the existing retry logic related to `compaction_stop_exception`. Test: unit(dev) Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267) Closes #9321 * github.com:scylladb/scylla: compaction: split compaction_aborted_exception from compaction_stopped_exception compaction_manager: maybe_stop_on_error: rely on retry=false default compaction_manager: maybe_stop_on_error: sync return value with error message. compaction: drop retry parameter from compaction_stop_exception compaction_manager: move errors stats accounting to maybe_stop_on_error	2021-10-04 07:27:11 +03:00
Takuya ASADA	9c830297ac	scylla_util.py: add persistent disk support for GCE Just like EBS disks for EC2, we want to use persistent disk on GCE. We won't recommend to use it, but still need to support it. Related scylladb/scylla-machine-image#215 Closes #9395	2021-10-03 17:58:18 +03:00
Takuya ASADA	d87b80ad14	scylla_util.py: add persistent disk support for Azure Just like EBS disks for EC2, we want to use persistent disk on Azure. We won't recommend to use it, but still need to support it. Related https://github.com/scylladb/scylla-machine-image/issues/218 Closes #9417	2021-10-03 17:56:31 +03:00
Avi Kivity	adcd5a69d6	Update seastar submodule * seastar e6db0cd587...0ba6c36cc3 (6): > semaphore: add try_get_units > build: adjust compilation for libfmt 8+ > alloc_failure_injector: add explicit zero-initialization > Change wakeup() from private to public in reactor.hh > app-template: separate seastar options into --seastar-help > files: Don't ignore FS info for read-only files	2021-10-03 13:14:43 +03:00
Piotr Sarna	1d353bd6e7	docs: mention scripts/pull_github_pr.sh The pull_github_pr.sh script is preferred over colorful github buttons, because it's designed to always assign proper authors. It also works for both single- and multi-patch series, which makes the merging process more universal. Message-Id: <b982b650442456b988e1cea59aa5ad221207b825.1633101849.git.sarna@scylladb.com>	2021-10-03 10:19:26 +03:00
Michał Radwański	0d5a2067ad	test/lib/failure_injecting_allocation_strategy: remove UB... by setting _alloc_count initially to 0. The _alloc_count hasn't been explicitely specified. As the allocator has been usually an automatic variable, _alloc_count had initially some unspecified contents. This probalby means that cases where the first few allocations passed and the later one failed, might haven't ever been tested. Good thing is that most of the users have been transferred to the Seastar failure injector, which (by accident) has been correct. Closes #9420	2021-10-01 13:25:05 +02:00
Pavel Emelyanov	85d86cc85f	scripts: Fix origin repo URL parsing It assumes that the origin URL is git@ one while it can be the https:// one as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211001082116.7214-1-xemul@scylladb.com>	2021-10-01 13:22:06 +02:00

1 2 3 4 5 ...

28551 Commits